I’m trying to parse a quite strange page. Here’s a simplified version: <!DOCTYPE html

Question

0

Asked: June 18, 20262026-06-18T03:53:49+00:00 2026-06-18T03:53:49+00:00

I’m trying to parse a quite strange page. Here’s a simplified version: <!DOCTYPE html

0

I’m trying to parse a quite strange page. Here’s a simplified version:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml">
    <form id="x" method="post" action="x">
        <input type="hidden" name="v1" value="v" />
            <html xmlns="http://www.w3.org/1999/xhtml">
                <input type="hidden" name="v2" value="v" />
            </html>
    </form>
</html>

Yes, there’s an html tag inside the form.

Is this valid (X)HTML at all? I know this was (at least partially) done using Java Server Faces.

As to the actual problem:

>>> BeautifulSoup(html).find("form")
<form id="x" method="post" action="x">
<input type="hidden" name="v1" value="v" />
</form>

BeautifulSoup doesn’t like this at all, and just pretends it doesn’t exist.

Has anyone gone through something similar?
I guess I could parse raw XML, but I’d like to use BeautifulSoup, if possible.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T03:53:50+00:00

I’ve seen this happen when multiple server sources are combined without checking the output. I don’t think there’s a scenario in which an html tag is ever valid in the middle of a document (other than in an iframe tag). The snippet you posted certainly isn’t valid (validator.w3.org)

If the rogue tag appears in a predictable location, a string replace is a quick solution so that you can subsequently parse it properly.

I guess I could parse raw XML

Assuming the document conforms to its XHTML doctype for well-formedness (meaning, it is valid XML even if not valid XHTML), you could:

parse the document as XML
modify the markup to something valid (e.g. unwrap the inner elements, or change it to a div)
parse as HTML with BeautifulSoup.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse a quite strange page. Here’s a simplified version: <!DOCTYPE html

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply