By default lxml doesn’t understsand the wbr tag, used to add word-breaks in long words. It formats it as <wbr></wbr> when it should be formatted simply as <wbr>, similar to the br tag.
How do I add this behavior to lxml?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)
First define a test program
wbr_test.py:Make sure that it fails by running
python wbr_test.py. It should insert a<\wbr>before<\body>, and printnot okat the end.Download, extract and compile
libxml2:Install, and install python libxml2 bindings:
Test your
wbr_test.pyonce more, to make sure it fails with the latest libxml2 version.First make a copy of
HTMLparser.ce.g. in/var/tmp.Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word
forced(only one occurrence). You will be at the<br>tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of<var>). To get the final comma right in the table insert the three lines before the one with just'}'not the one with'};'.In the newly inserted code Replace
brwithwbrand changeDECL clear_attrstoNULL(assuming that a new tag does not have deprecated attributes).The result should diff with the version in
/var/tmp(diff -u HTMLparser.c /var/tmp) as follows:Make and install:
Test your
wbr_test.pyonce more. Should showOK