I am reading some product pages in python/BS4 and find an interesting variety in one line of code, the price of the item.
Sometimes the HTML is:
<span class="currency">$<span id="product_price">0.00</span></span>
And other times it will be:
<span class="currency">$17.95</span></b>
Using price = soup.find('span', {'class' : 'currency'})
I can isolate the span, but when I try to get just the text, using
priceStr = price.findAll(text=re.compile(r''))
and then write it to the output file with
divpage.write('Price = ' + str(priceStr) + '\n')
I get (for the first example):
Price = [u'$', u'0.00']
My question is, is there a way to read JUST the price, without the ‘$’, and how do I translate the encoding from the “u’0.00′” to just “0.00”?
I know I can do this using the Python find & replace functions, but I’d like to stick to BSS4 as much as possible, without having to write w check for one form or the other…
I would use get_text() instead of find_all()
Then you can use lstrip to get rid of the dollar sign
And you’re done!