I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site.
Are there existing free libraries to get me there? I’ve seen Chilkat, but it’s for pay. I’m just looking for baseline functionality here. Thoughts? Suggestions?
Exact Duplicate: Anyone know of a good python based web crawler that I could use?
Use Scrapy.
It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:
Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned: