I am trying to parse arbitrary documents download from the wild web, and yes,

Question

0

Asked: May 27, 20262026-05-27T17:27:25+00:00 2026-05-27T17:27:25+00:00

I am trying to parse arbitrary documents download from the wild web, and yes,

0

I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.

Since Beautiful Soup won’t choke if you give it bad markup… I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.

The line where the error occurred is the 3rd one:

from BeautifulSoup  import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)

CLI full output is:

Traceback (most recent call last):
  File "./grablinks", line 101, in <module>
    sys.exit(main())
  File "./grablinks", line 88, in main
    links = grab_links(options)
  File "./grablinks", line 36, in grab_links
    doc = doc_parser(reader)
  File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
    self._feed(isHTML=isHTML)
  File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
        k = self.parse_endtag(i)
  File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
    self.finish_endtag(tag)
  File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
    method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T17:27:25+00:00

Yeah, It will choke if you have elements with non-ASCII names (<café>). And that’s not even ‘bad markup’, for XML…

It’s a bug in sgmllib which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.

You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError: to except AttributeError, UnicodeError: but that’s not really a good fix. Not trivial to override the rest of the method either.

What is it you’re trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn’t have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn’t XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib is your better option these days.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse arbitrary documents download from the wild web, and yes,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply