So basically, you want to use Sphinx as a library?…

Question

0

Asked: May 11, 20262026-05-11T00:30:02+00:00 2026-05-11T00:30:02+00:00

I need to crawl and store locally for future analysis the contents of a

0

I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site.

Are there existing free libraries to get me there? I’ve seen Chilkat, but it’s for pay. I’m just looking for baseline functionality here. Thoughts? Suggestions?

Exact Duplicate: Anyone know of a good python based web crawler that I could use?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-11T00:30:02+00:00

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

Built-in support for parsing HTML, XML, CSV, and Javascript
A media pipeline for scraping items with images (or any other media) and download the image files as well
Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
Interactive scraping shell console, very useful for developing and debugging
Web management console for monitoring and controlling your bot
Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):     pass  class MininovaSpider(CrawlSpider):     domain_name = 'mininova.org'     start_urls = ['http://www.mininova.org/today']     rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]      def parse_torrent(self, response):         x = HtmlXPathSelector(response)         torrent = Torrent()          torrent.url = response.url         torrent.name = x.x('//h1/text()').extract()         torrent.description = x.x('//div[@id='description']').extract()         torrent.size = x.x('//div[@id='info-left']/p[2]/text()[2]').extract()         return [torrent]

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions