When I read some (but not all) HTML files in python using a urllib2 opener, on some files I’m getting text filled with lots of backslashes and the unicode 003c strings. I’m sending this text into BeautifulSoup and am having trouble finding what I’m looking for with findAll(), and I’m now thinking it’s due to all these unicode strings.
What’s going on with this, and how do I get rid of it?
Approaches like soup.prettify() have no effect.
Here’s some example code (this comes from a Facebook profile)
\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
\\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
\\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
\\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
\\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">
This same HTML page looks fine and normal in a ‘view source’ window.
EDIT: Here’s the code that’s producing that text. What’s strange is that I don’t get this kind of output from other HTML pages. Note that I’ve replaced the username and password with USERNAME and PASSWORD for here. You could try this on your own FB profile if you replace those two.
fbusername = "USERNAME@gmail.com"
fbpassword = "PASSWORD"
cookiefile = "facebook.cookies"
cj = cookielib.MozillaCookieJar(cookiefile)
if os.access(cookiefile, os.F_OK):
cf.load()
opener = urllib2.build_opener(
urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cj)
)
opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1'),('Referer','http://www.facebook.com/')]
def facebooklogin():
logindata = urllib.urlencode({
'email' : fbusername,
'pass' : fbpassword,
})
response = opener.open("https://login.facebook.com/login.php",logindata)
return ''.join(response.readlines())
print "Logging in to Facebook...\n"
facebooklogin()
facebooklogin()
print "Successful.\n"
fetchURL = 'http://www.facebook.com/USERNAME?ref=profile&v=info'
f = opener.open(fetchURL)
fba = f.read()
f.close()
soup = BeautifulSoup(fba)
print soup
The
u"""construct is for Python 2. You omit theufor Python 3.I hope this helps. If not, please improve the information you give in your question.
EDIT: suggested answer now changes
\/to/too.