Can somebody distinguish between a crawler and scraper in terms of scope and functionality.
Can somebody distinguish between a crawler and scraper in terms of scope and functionality.
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
A crawler gets web pages — i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).
A scraper takes pages that have been downloaded or, in a more general sense, data that’s formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.
Depending on how you use the result, scraping may well violate the rights of the owner of the information and/or user agreements about use of web sites (crawling violates the latter in some cases as well). Many sites include a file named robots.txt in their root (i.e. having the URL
http://server/robots.txt) to specify how (and if) crawlers should treat that site — in particular, it can list (partial) URLs that a crawler should not attempt to visit. These can be specified separately per crawler (user-agent) if desired.