I have a file which has xml like tags and a bunch of invalid xml data because of which I cannot use a normal xml validators like xmllint on the the file. I want to ignore the invalid xml data and just check the file for well formedness.
<?xml version="1.0" encoding="utf-8"?>
<HOST>
<VERSION>5</VERSION>
<OUTPUT>
bunch of text which also contains tags like <SYSTEM>
more tags like <-> <temp> & ;
some more text and numbers
</OUTPUT>
</HOST>
In the above example can I just ignore tags like <system>, <->, &, ; etc and just check for valid opening and closing tags like <HOST> </HOST> <VERSION> </VERSION> and <OUTPUT>
</OUTPUT>. The above file should return back saying its well formed since all the valid tags have proper opening and closing brackets.
Can I create my own dtd/xsd ?? to look for the tags which I want and ignore rest of tags using Perl.
My main problem is that I dont know the right keywords to describe my problem which is why google is not giving me the right results. Can someone please push me in the right direction. Thanks
You’ll have to clean up the input first. Once you do that, then you can do DTD, schemas, proper parsing, and whatever.
If it’s just the
OUTPUTtag, you can try this:After that is done, your input should be ready for XML parsing, validation, etc. If your input might contain CDATA sections, you’ll have to do more, but that should be enough to get started.