I’m writing some Python code to scrape websites, and what I’m going to end up with is a growing collection of custom scrapers, each about 50 lines long and tailored extract specific information from a specific website.
My first iteration of the program is one giant file that takes a website as an argument, and scrapes that website if it recognizes it and has custom code for it (using a giant case statement to check if it recognizes the website).
Obviously, this isn’t a great design, so what I’d like to do is pull the custom scrape functions into their own files/classes, and have a small script that I can use to call them by name. For example:
scrape.py --site google
And I’d like to have a file structure similar to:
scrape.py
sites/
google.py
yahoo.py
...
bing.py
I haven’t mastered object orientation yet, but I recognize that this is calling out for it, and that what I’m looking for is probably a common OO pattern.
Any help getting this code refactored properly?
PS – I’ve looked at Scrapy, and it’s not really what I need for various reasons.
PPS – I’m not actually scraping search websites, I’m scraping U.S. court websites.
You can put the code in a class with an
__init__method to get everything configured, a_downloadmethod to connect to the site and download it, a_storemethod to save the results and arunmethod to tie it all together, like so:This class can live in your
parser.pyfile.In each one of your site specific files, put two things.
Then you can set up your
python.pyfile with the following function:You can then use it like
Doing it this way has the advantage of not requiring you to make any changes to the
Scraperclass. If you need to do different tricks to get the servers to talk to your scraper, then you can create aDownloaderclass in each module and use it just like theParserclass. If you have two or more parsers that do the same thing, just define them as a generic parser in a separate module and import that into the module of each site that requires it. Or subclass it to make tweaks. Without knowing how you’re downloading and parsing the sites, it’s hard to be more specific.My feeling is that you might have to ask several questions to get all of the details ironed out but it will be a good learning experience.