In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I’ve been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?
hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
The above expression is from Scrapy. I’m trying to apply the regex re('\.a\w+') to td class altRow to get the links from there.
I would also appreciate pointers to any other tutorials or documentation. I couldn’t find any.
Thanks for your help.
Edit:
I am looking at this page:
>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>
Yet, if you look at the page source "/cabel" is there:
<td class="altRow" valign="middle" width="34%">
<a href='/cabel'>Abel, Christian</a>
For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+') catches “/cabel”
Edit:
cobbal: It is still not working. But when I search this:
>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>
it returns all the links with second character “a” but not the lawyer names. So for some reason those links (such as “/cabel”) are not visible to BeautifulSoup. I don’t understand why.
I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:
I extracted 914 references from your page, from Abel to Zupikova.