For example if I have this html:
<div>this is a test < text</div>
the < after the test is an error and the right html should be
<div>this is a test < text</div>
But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.
Well, the sames applies to the > character and to text that has both < and > characters Like “<2000> – <2004>”. I would like to hear ideas for algorithms or libraries that can help me. Thanks.
Note: the html sample above is a sample and the work should be done on big html files.
I’d suggest this:
identify and map locations of all known tags, like
<div>and</a>.Replace < and > everywhere outside the map you built in step 1.