I am using beautiful soup to scrape some data from a website but I am not able to remove html tags from the data while printing it. Referred code is:
import csv
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
print anchor1
for anchor2 in soup.findAll('div', {"class": "gridPrice"}):
print anchor2
for anchor3 in soup.findAll('div', {"class": "gridMultiDevicePrice"}):
print anchor3
Output which I am getting using this, looks like this:
<div class="listGrid-price">
$99.99
</div>
<div class="listGrid-price">
$0.01
</div>
<div class="listGrid-price">
$0.01
</div>
I want only prices in output without having any html tags around it. Pardon me for my ignorance as I am new to programming.
You are printing the found tag. To only print the contained text, use the
.stringattribute:The
.stringvalue is aNavigableStringinstance; to use it like a normal unicode object, convert it first. Then you could usestrip()to remove the extra whitespace:Adjusting this a little to allow for empty values:
which gives me: