What I currently do is, Parse texts from a URL, and then clean the texts and explode them by spaces and save to a file.
What I find hard is,
Saving only unique files incase of scraping multiple urls:
case : scraped words from site.com/page1 and saved unique words to file. When scraping site.com/page2, I need to check if each word is in the file already and save it only if its not present.
What I have in my mind is, take $word[0], and fgets each line from the file and check and save if its not found. But that would be like thousands – hundred thousand times of iterations.
I am not looking for any codes, but just an idea how to handle it efficiently and fast.
I’m assuming that you have already stored unique words you got from site1 in a file called
site1.txt, and you’ve already scraped words from site2 in an array called$site2, now you’d like to store$site2line by line in a filesite2.txt, only storing unique words: