I’m using mechanize/cookiejar/lxml to read a page and it works for some but not

Question

0

Asked: May 19, 20262026-05-19T02:32:07+00:00 2026-05-19T02:32:07+00:00

I’m using mechanize/cookiejar/lxml to read a page and it works for some but not

0

I’m using mechanize/cookiejar/lxml to read a page and it works for some but not others. The error I’m getting in them is the one in the title. I can’t post the pages here because they aren’t SFW, but is there a way to fix it? Basically, this is what I do:

import mechanize, cookielib
from lxml import etree    

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]

response = br.open('...')
tree = etree.parse(response) #error

After that I get the root and search the document for the values I want. Apparently iterparse doesn’t crash it, but at the moment I’m assuming it doesn’t just because I didn’t process anything with it. Plus, I haven’t figured out yet how to search for the stuff with it.

I’ve tried disabling gzip and enabling sending the referer as well but neither solves the problem. I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error.

edit
The response I get seems to be fine, using print repr(response) as suggested I get a <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>. I can also save the response using the read() method and check that the saved .xml works on the browser and everything.

Also, in one of the pages, there is a ’ that gives me the following error: “lxml.etree.XMLSyntaxError: Entity ‘rsquo’ not defined, line 17, column 7054”. So far I’ve replaced it with a regex, but is there a parser that can handle this? I’ve gotten this error even with the lxml.html.parse suggested below.

Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T02:32:08+00:00

Editorial Team

2026-05-19T02:32:08+00:00Added an answer on May 19, 2026 at 2:32 am

use lxml.html.parse for html it can handle even very broken html, you still get an error then?

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using mechanize/cookiejar/lxml to read a page and it works for some but not

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply