In answer to a previous question , several people suggested that I use BeautifulSoup

Question

0

Asked: May 13, 20262026-05-13T06:17:20+00:00 2026-05-13T06:17:20+00:00

In answer to a previous question , several people suggested that I use BeautifulSoup

0

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I’ve been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

The above expression is from Scrapy. I’m trying to apply the regex re('\.a\w+') to td class altRow to get the links from there.

I would also appreciate pointers to any other tutorials or documentation. I couldn’t find any.

Thanks for your help.

Edit:
I am looking at this page:

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

Yet, if you look at the page source "/cabel" is there:

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabel'>Abel, Christian</a>

For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+') catches “/cabel”

Edit:
cobbal: It is still not working. But when I search this:

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

it returns all the links with second character “a” but not the lawyer names. So for some reason those links (such as “/cabel”) are not visible to BeautifulSoup. I don’t understand why.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T06:17:21+00:00

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In answer to a previous question , several people suggested that I use BeautifulSoup

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply