When I read some (but not all) HTML files in python using a urllib2

Question

0

Asked: May 23, 20262026-05-23T10:13:24+00:00 2026-05-23T10:13:24+00:00

When I read some (but not all) HTML files in python using a urllib2

0

When I read some (but not all) HTML files in python using a urllib2 opener, on some files I’m getting text filled with lots of backslashes and the unicode 003c strings. I’m sending this text into BeautifulSoup and am having trouble finding what I’m looking for with findAll(), and I’m now thinking it’s due to all these unicode strings.

What’s going on with this, and how do I get rid of it?

Approaches like soup.prettify() have no effect.

Here’s some example code (this comes from a Facebook profile)

\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
\\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
\\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
\\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
\\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">

This same HTML page looks fine and normal in a ‘view source’ window.

EDIT: Here’s the code that’s producing that text. What’s strange is that I don’t get this kind of output from other HTML pages. Note that I’ve replaced the username and password with USERNAME and PASSWORD for here. You could try this on your own FB profile if you replace those two.

fbusername = "USERNAME@gmail.com"
fbpassword = "PASSWORD"
cookiefile = "facebook.cookies"

cj = cookielib.MozillaCookieJar(cookiefile)
if os.access(cookiefile, os.F_OK):
    cf.load()

opener = urllib2.build_opener(
    urllib2.HTTPRedirectHandler(),
    urllib2.HTTPHandler(debuglevel=0),
    urllib2.HTTPSHandler(debuglevel=0),
    urllib2.HTTPCookieProcessor(cj)
)

opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1'),('Referer','http://www.facebook.com/')]

def facebooklogin():
    logindata = urllib.urlencode({
        'email' : fbusername,
        'pass' : fbpassword,
    })

    response = opener.open("https://login.facebook.com/login.php",logindata)
    return ''.join(response.readlines())


print "Logging in to Facebook...\n"
facebooklogin()
facebooklogin()
print "Successful.\n"

fetchURL = 'http://www.facebook.com/USERNAME?ref=profile&v=info'

f = opener.open(fetchURL)
fba = f.read()
f.close()
soup = BeautifulSoup(fba)
print soup

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T10:13:24+00:00

The u""" construct is for Python 2. You omit the u for Python 3.

>>> a=u"""\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
... \\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
... \\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
... \\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
... \\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">
... """
>>> print(a.decode('unicode_escape')).replace('\\/', '/')
<div class="pas status fcg">Loading...<\/div>
<\/div><div class="uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem" id="u971289_14"><\/div>
<\/div><\/div><div class="fbNubFlyoutFooter">
<div class="uiTypeahead uiClearableTypeahead fbChatTypeahead" id="u971289_15">
<div class="wrap"><label class="clear uiCloseButton" for="u971291_21">

I hope this helps. If not, please improve the information you give in your question.

EDIT: suggested answer now changes \/ to / too.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When I read some (but not all) HTML files in python using a urllib2

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply