I made some good progress for Day 3. After experimenting, I found how to separate the keywords for each URL using a counter. This needed to implement a counter and some vectors though. It might be a good idea to analyze my code as I progress through the 100 days of code challenge.
Analysis
First, I iterate through each line from the CSV that contains the settings to separate URLs and keywords. Since the values are separated by commas, the implementation was quite easy. I created a variable, assigned the index of the comma to it, and used its value with the std::string's substring function.
CSV_Handler handler;
handler.ReadSettings();
// Separate urls and keywords
for (const std::string &line : handler.csvLines)
{
int index = (int)line.find(",");
getSettingsUrl.push_back(line.substr(0, index));
getSettingsKeywords.push_back(line.substr(index + 1));
}
The searchUrls variable now holds all the URLs from the settings. This means that duplicate URLs exist in the vector. It is a good time to implement a way to remove duplicates and find a way to count the amount of keywords each URL has.
The loop iterates through the getSettingsUrl vector to check if there are duplicates of it in the getUrls vector.
The first conditional statement is used to add the first URL to the getUrls vector and count how many times a URL is held by the vector with the help of the std::count() function. The statement was created because it was not possible to count how many times the first URL exists.
Next, the std::find function was used to check if the current URL already exists in the getUrls vector. If it exists, then it moves to the next iteration. If it doesn't exist, then it is added to the vector. In the meantime, the counter is updated accordingly.
The urlCounterHolder is used to store the number of keywords each URL has. This is achieved by iterating through it.
When the process is complete, the getSettingURL is not needed. It is cleaned up to free some memory the counter is then set to 0 as it will be needed in the next part.
std::vector<std::string> getUrls;
std::vector<int> urlCounterHolder;
int counter = 0;
// count how many keywords each url has
for (const auto &url : getSettingsUrl)
{
if (urlCounterHolder.empty())
{
int count = (int)std::count(getSettingsUrl.begin(),
getSettingsUrl.end(),
url);
urlCounterHolder.push_back((count));
getUrls.push_back(url);
}
if (std::find(getUrls.begin(), getUrls.end(), url) != std::end(getUrls))
{
counter++;
continue;
}
else
{
urlCounterHolder.push_back(counter);
counter = 0;
getUrls.push_back(url);
counter++;
}
}
getSettingsUrl.clear();
counter = 0;
The next part of the code will be used for actual web scraping. The loops are used to iterate through the keywords and URLs properly.
// Start scraping
for (int amount : urlCounterHolder)
{
std::cout << amount << std::endl;
for (int j = 0; j < amount; j++)
{
std::cout << getSettingsKeywords[j] << std::endl;
std::cout << getUrls[counter] << std::endl;
}
counter++;
}
In the next, post the whole scraping system should work.