I’m using JSoup in an attempt to built valid XML from a couple of websites. Most of the time it has worked phenomenally well, but recently I’ve encountered some cases of bad HTML that JSoup can’t seem to fix.
<meta name="saploTags" content="Tag1,Tag2,Tag3," Tag4,Tag5,Tag6"/>
Results in
<meta name="saploTags" content="Tag1,Tag2,Tag3," tag4,tag5,tag6"="" />
This causes problems later on when I’m trying to index the resulting XML. Does anyone have any suggestions what to do? Preferably I’d have everything between the leftmost and rightmost quotation marks escaped or removed in some way in order to prevent data loss (like content=”Tag1,Tag2,Tag3,Tag4,Tag5,Tag6″. Otherwise it would be ok if JSoup cut off after its first “end quote”, disregarding the last tags, like content=”Tag1,Tag2,Tag3″.
(Similar problems that I’ve found is e.g. <img src=".." alt="This text contains the quote "The quote" and here's some more text"/> which causes similar problems)
Is it possible to get around this with jsoup, or have I reached a dead end?
/Regards, Magnus
That’s quite simply not valid XML nor HTML. Those double quotes should be turned into character references if they’re to be considered as part of the attribute value. Even if you could set a parser to be very lenient, it’s not gonna be able to solve this because it is no longer clear where the attribute content ends.
Trying to automatically fix this seems rather difficult. There’s all sorts of corner cases that’ll wreak havoc on any sort of solution. How’s this supposed to be interpreted, for example:
Look at how the SO code formatter struggles with it.
Even making sense of this yourself is difficult, let alone writing a tool that’s gonna make sense of what is or isn’t attribute content.
Simple approach? Just don’t accept invalid HTML. It’s lenient enough as it is, with most parsers allowing lower case and upper case element names, closing tags not always being mandatory etc. If people still manage to generate invalid HTML, then too bad for them.