I am trying to parse a library website to obtain information from a specific publisher. Here is the link to the website.
http://hollis.harvard.edu/?q=publisher:%22sonzogno%22+ex-Everything-7.0:%221700-1943%22+
So far by using beautiful soup, I can grab data that I need from this page. The problem being my script grabs only the first 25 entries ( a single pages worth) from the the entire result set which has a lot more.
What am I missing here?
Here is the small snippet of code.
def url_parse(name):
if(name == " "):
print 'Invalid Error'
else:
response = urllib2.urlopen(name)
html_doc = response.read()
soup = BeautifulSoup(html_doc)
print soup.title
print soup.find_all("a",{"class":"classiclink"})
#print soup.find("a",{"class":"classiclink"})
aleph_li = [] # creates and emptylist
aleph_li = soup.find_all("a",{"class":"classiclink"})
After this I plan to use the information available in these tags.So far like you said, I can grab only 25 of them.
I am unable to iterate through each page, as the url(containing some sort of query) doesn’t seem to have any page information. I am not sure how make recurring requests to the server.
Thanks.
Maybe this won’t be so hard:
If you look at the request to get other page, which is called
result.ashx, you can see the following parameters:So try to add a parameter
curpagein your own request. It’s likely that you’re going to have to use a loop to go through all the results but this seems very doable: