Which one is better and more useful for malformed html?
I cannot find how to use libxml2.
Thanks.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
In the libxml2 page you can see this note:
and in the lxml page this other one:
So essentially, with
lxmlyou get exactly the same functionality,but with a a pythonic API compatible with the
ElementTreelibrary in the standard library (so this means the standard library documentation will be useful to learn how to uselxml). That’s why,lxmlis preferred overlibxml2(even when the underlying implementation is the same one).Edit: Having said that, as other answers explain, to parse malformed html your best option is to use
BeautifulSoup. One interesting thing to note is that, if you have installedlxml,BeautifulSoupwill use it as explained in the documentation for the new version:Anyway, even if
BeautifulSoupuseslxmlunder the hood, you’ll be able to parse brokenhtmlthat you can’t parse withxmldirectly. For example:However:
Finally, note that
lxmlalso provides an interface to the old version ofBeautifulSoupas follows:So at the end of the day, you’ll probably be using
lxmlandBeautifulSoupanyway. The only thing you’ve got to choose is what’s the API that you like the most.