Im trying to scrape data from IMDB, but naturally there are a lot of pages, and doing it in a serial fashion takes way too long. Even with I do multi-threaded CURL.
Is there a faster way of doing it?
Yes I know IMDb offers text files, but they dont offer everything, in any sane fashion.
I’ve done a lot of brute force scraping with PHP and sequential processing seems to be fine. I’m not sure “what a long time” to you is, but I often do other stuff while it scrapes.
Typically nothing is dependent on my scraping in real time, its the data that counts, and I usually scrape it and massage it at the same time.
Other times I’ll use a crafty wget command to pull down a site and save locally. Then have a PHP script with some regex magic extract the data.
I use curl_* in PHP and it works very good.
You could have a parent job that forks child processes providing them URL’s to scrape, which they process and save the data locally (db, fs, etc). The parent is responsible for making sure the same URL isn’t processed twice and children don’t hang.
Easy to do on linux (pcntl_, fork, etc), harder on windows boxes.
You could also add some logic to look at the last-modified-time and (which you previously store) and skip scraping the page if not content has changed or you already have it. There’s probably a bunch of optimization tricks like that you could do.