I’m trying to parse a quite strange page. Here’s a simplified version:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml">
<form id="x" method="post" action="x">
<input type="hidden" name="v1" value="v" />
<html xmlns="http://www.w3.org/1999/xhtml">
<input type="hidden" name="v2" value="v" />
</html>
</form>
</html>
Yes, there’s an html tag inside the form.
Is this valid (X)HTML at all? I know this was (at least partially) done using Java Server Faces.
As to the actual problem:
>>> BeautifulSoup(html).find("form")
<form id="x" method="post" action="x">
<input type="hidden" name="v1" value="v" />
</form>
BeautifulSoup doesn’t like this at all, and just pretends it doesn’t exist.
Has anyone gone through something similar?
I guess I could parse raw XML, but I’d like to use BeautifulSoup, if possible.
I’ve seen this happen when multiple server sources are combined without checking the output. I don’t think there’s a scenario in which an
htmltag is ever valid in the middle of a document (other than in aniframetag). The snippet you posted certainly isn’t valid (validator.w3.org)If the rogue tag appears in a predictable location, a string replace is a quick solution so that you can subsequently parse it properly.
Assuming the document conforms to its XHTML doctype for well-formedness (meaning, it is valid XML even if not valid XHTML), you could:
div)