I am trying to parse a library website to obtain information from a specific

Question

0

Asked: June 13, 20262026-06-13T10:12:55+00:00 2026-06-13T10:12:55+00:00

I am trying to parse a library website to obtain information from a specific

0

I am trying to parse a library website to obtain information from a specific publisher. Here is the link to the website.

http://hollis.harvard.edu/?q=publisher:%22sonzogno%22+ex-Everything-7.0:%221700-1943%22+

So far by using beautiful soup, I can grab data that I need from this page. The problem being my script grabs only the first 25 entries ( a single pages worth) from the the entire result set which has a lot more.

What am I missing here?

Here is the small snippet of code.

def url_parse(name):

  if(name == " "):
    print 'Invalid Error'
  else:
    response = urllib2.urlopen(name)
    html_doc = response.read()
    soup = BeautifulSoup(html_doc)
    print soup.title
    print soup.find_all("a",{"class":"classiclink"})
    #print soup.find("a",{"class":"classiclink"})
    aleph_li = [] # creates and emptylist
    aleph_li = soup.find_all("a",{"class":"classiclink"})

After this I plan to use the information available in these tags.So far like you said, I can grab only 25 of them.

I am unable to iterate through each page, as the url(containing some sort of query) doesn’t seem to have any page information. I am not sure how make recurring requests to the server.

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T10:12:56+00:00

Maybe this won’t be so hard:

If you look at the request to get other page, which is called result.ashx, you can see the following parameters:

inlibrary:false
noext:false
debug:
lastquery:publisher:"sonzogno" ex-Everything-7.0:"1700-1943"
lsi:user
uilang:en
searchmode:assoc
hardsort:def
skin:harvard
rctx:AAMAAAABAAAAAwAAABJ/AAAHaGFydmFyZDJwdWJsaXNoZXI6InNvbnpvZ25vIiBleC1FdmVyeXRoaW5nLTcuMDoiMTcwMC0xOTQzIjJwdWJsaXNoZXI6InNvbnpvZ25vIiBleC1FdmVyeXRoaW5nLTcuMDoiMTcwMC0xOTQzIhJzb256b2dubyAxNzAwLTE5NDMAAAAAA25hdgR1c2VyAAAAA2RlZgpyZXN1bHRsaXN0BWFzc29jBQAAAAAAAAACZW4AAP////9AEAAAAAAAAAIAAAAGY19vdmVyATEEaV9mawAAAAAA
c_over:1
curpage:3
concept:sonzogno 1700-1943
branch:
ref:
i_fk:
mxdk:-1
q:publisher:"sonzogno" ex-Everything-7.0:"1700-1943"
si:user
cs:resultlist
cmd:nav

So try to add a parameter curpage in your own request. It’s likely that you’re going to have to use a loop to go through all the results but this seems very doable:

params = urllib.urlencode({"curpage": NUMBER})
urllib2.urlopen(YOUR_PAGE, params)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse a library website to obtain information from a specific

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply