I’m a bit new to Python dev — I’m creating a larger project for some web scraping. I want to approach this as “Pythonically” as possible, and would appreciate some help with the project structure. Here’s how I’m doing it now:
Basically, I have a base class for an object whose purpose is to go to a website and parse some specific data on it into its own array, jobs[]
minion.py
class minion:
# Empty getJobs() function to be defined by object pre-instantiation
def getJobs(self):
pass
# Constructor for a minion that requires site authorization
# Ex: minCity1 = minion('http://portal.com/somewhere', 'user', 'password')
# or minCity2 = minion('http://portal.com/somewhere')
def __init__(self, title, URL, user='', password=''):
self.title = title
self.URL = URL
self.user = user
self.password = password
self.jobs = []
if (user == '' and password == ''):
self.reqAuth = 0
else:
self.reqAuth = 1
def displayjobs(self):
for j in self.jobs:
j.display()
I’m going to have about 100 different data sources. The way I’m doing it now is to just create a separate module for each “Minion”, which defines (and binds) a more tailored getJobs() function for that object
Example: minCity1.py
from minion import minion
from BeautifulSoup import BeautifulSoup
import urllib2
from job import job
# MINION CONFIG
minTitle = 'Some city'
minURL = 'http://www.somewebpage.gov/'
# Here we define a function that will be bound to this object's getJobs function
def getJobs(self):
page = urllib2.urlopen(self.URL)
soup = BeautifulSoup(page)
# For each row
for tr in soup.findAll('tr'):
tJob = job()
span = tr.findAll(['span', 'class="content"'])
# If row has 5 spans, pull data from span 2 and 3 ( [1] and [2] )
if len(span) == 5:
tJob.title = span[1].a.renderContents()
tJob.client = 'Some City'
tJob.source = minURL
tJob.due = span[2].div.renderContents().replace('<br />', '')
self.jobs.append(tJob)
# Don't forget to bind the function to the object!
minion.getJobs = getJobs
# Instantiate the object
mCity1 = minion(minTitle, minURL)
I also have a separate module which simply contains a list of all the instantiated minion objects (which I have to update each time I add one):
minions.py
from minion_City1 import mCity1
from minion_City2 import mCity2
from minion_City3 import mCity3
from minion_City4 import mCity4
minionList = [mCity1,
mCity2,
mCity3,
mCity4]
main.py references minionList for all of its activities for manipulating the aggregated data.
This seems a bit chaotic to me, and was hoping someone might be able to outline a more Pythonic approach.
Thank you, and sorry for the long post!
Instead of creating functions and assigning them to objects (or whatever
minionis, I’m not really sure), you should definitely use classes instead. Then you’ll have one class for each of your data sources.If you want, you can even have these classes inherit from a common base class, but that isn’t absolutely necessary.