I use hxt to parse some html. It hase unescaped html inside <textarea>. hxt gives invalid results (it stumbles upon a tag with content in this case it’s <a>). Minimal testcase (for GHCi) is
let doc = parseHtml "<textarea>before<a>link</a>after</textarea>"
runX . xshow $ doc //> hasName "textarea"
which gives [<textarea>before</textarea><textarea/>] as a result.
It looks like tags with no contents (e.g. <tag/>) do not break parsing.
Is there any way to parse such html with hxt?
The problem is that HandsomeSoup (which I’m assuming is where your
parseHTMLis from) is picky about things like the fact that atextareacan’t contain anain valid HTML, and will try to “fix” any such errors it sees.Can you switch to hxt-tagsoup? It will still accept messy HTML (unclosed elements, etc.), but isn’t so fussy about adherence to the HTML schema—specifically it will let you have an
ain atextarea:This prints the following:
Which I think is what you want.