I’m trying to make a web scraper that will parse a web-page of publications

Question

0

Asked: May 12, 20262026-05-12T12:06:23+00:00 2026-05-12T12:06:23+00:00

I’m trying to make a web scraper that will parse a web-page of publications

0

I’m trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

I’ve been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I’m not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I’m not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?

At the moment, my code looks like what is below:

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.

EDIT: I suppose that I didn’t make this quite clear, but I have multiple tags in page that I want to scrape.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T12:06:23+00:00

It’s not clear to me from your question why you need to worry about the div tags — what about doing just:

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

On the HTML you give, running this emits exactly:

####I want whatever is located here ###

which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn’t do — multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It’s hard to infer what exactly are your specs, just from this simple example and overabundant code;-).

Edit: if, as per the OP’s latest comment, there are multiple such td tags, one per author:

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

…i.e., not much harder at all!-)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to make a web scraper that will parse a web-page of publications

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply