I am looking for a large amount (>100k at least) of data from web 2.0 sites for a research project. I am thinking of using the exposed API to get the data, but would scrapping work better in this case?
The API is good (less work compared to writting a scraper), but I really have no idea how much time I need to collect that much data, considering there is usually a time/call limit. I’m not saying there is no limit in scraping though; just that I am curious which is a better way of getting the job done.
If the site provides an API, then use it.
It’s much simpler, generic, and legal. If the site is kind of popular, you often find wrappers for the language you’re using.
Of course, if you develop a scraper, you won’t have limitations, but maybe the site doesn’t allow being scraped, and that’s exactly why they have an API for users/developers.
About Jeffrey04 comment:
Let’s see… this is a moral thing. If you want, you can obtain that amount of data several times without being blocked. You can always change User-Agents, change IP after N requests (of course, all of this programatically), and do some tricks with Cookies, but that’s not the idea. What I mean is that the advice of not using website scraping is not because of getting banned from the website.