I am trying to parse through an HTML page which simplified looks like this:
<div class="anotherclass part"
<a href="http://example.com" >
<div class="column abc"><strike>£3.99</strike><br>£3.59</div>
<div class="column def"></div>
<div class="column ghi">1 Feb 2013</div>
<div class="column jkl">
<h4>A title</h4>
<p>
<img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
</div>
</a>
</div>
I am a beginner at coding python and I have read and re-read the beautifulsoup documentation at http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
I have got this code:
from BeautifulSoup import BeautifulSoup
with open("file.html") as fp:
html = fp.read()
soup = BeautifulSoup(html)
parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
mypart={}
# ghi
mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
# def
mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
# h4
mypart['title'] = part.find('h4').string
# jkl
mypart['other'] = part.find('p').string
# abc
pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
theprices = re.findall( pattern, str(part) )
if len(theprices) == 2:
mypart['price'] = theprices[1]
mypart['rrp'] = theprices[0]
elif len(theprices) == 1:
mypart['price'] = theprices[0]
mypart['rrp'] = theprices[0]
else:
mypart['price'] = None
mypart['rrp'] = None
I want to extract any text from the classes def and ghi which I think my script does correctly.
I also want to extract the two prices from abc which my script does in a rather clunky fashion at the moment. Sometimes there are two prices, sometimes one and sometimes none in this part.
Finally I want to extract the "A, List, Of, Terms, To, Extract" part from class jkl which my script fails to do. I thought getting the string part of the p tag would work but I cannot understand why it does not. The date in this part always matches the date in class ghi so it should be easy to replace/remove it.
Any advice? Thank-you!
First, if you add
convertEntities=bs.BeautifulSoup.HTML_ENTITIEStothen the html entities such as
£will be converted to their corresponding unicode character, such as£. This will allow you to use a simpler regex to identify the prices.Now, given
part, you can find the text content in the<div>with the prices using itscontentsattribute:All we need to do is extract the number from each item, or skip it if there is no number:
At this point
pricewill be a list of 0, 1, or 2 floats.We would like to say
but that would not work if
priceis[]or contains only one item.Your method of handling the three cases with
if..elseis okay — it is the most straightforward and arguably the most readable way to proceed. But it is also a bit mundane. If you’d like something a little more terse you could do the following:Since we want to repeat the same price if
pricecontains only one item, you might be led to think about itertools.cycle.In the case where
priceis the empty list,[], we wantitertools.cycle([None]), but otherwise we could useitertools.cycle(price).So to combine both cases into one expression, we could use
The
nextfunction peels off the values in the iteratorpriceone by one. Sincepriceis cycling through its values, it will never end; it will just keep yielding the values in sequence and then starting over again if necessary — which is just what we want.The
A, List, Of, Terms, To, Extract - 1 Feb 2013could be obtained again through the use of thecontentsattribute:So, the full runnable code would look like:
which yields