I have a series of XML files produced from a data playback utility. The utility produces correctly formed XML tags. Unfortunately, the utility isn’t perfect. Some of the Java objects it attempts to serialize fail and they are simply inserted (as binary blobs) in between these other, valid XML tags.
For example…
<track>
<cto>Valid_XML_HERE</cto>@Binary_Blob_of_Junk@<cto>(...)</cto>
</track>
Environment is RHEL-5, which means Python 2.4, Perl, or SED/AWK solutions are usable.
Any suggestions on how to remove the junk?
I built off of Birei’s suggestion to inspect tree elements, but came up with a SED-only solution. As shown in the OP, the
<cto>tags happen to be on one continuous line. The solution, then, was to split the lines such that each<cto>tag was on a new line — thus, also isolating the junk binary data on new lines — and then simply select lines starting with a<cto>tag.The
<tracks>and</tracks>tag can simply be added to the new file via CAT.Here are the SED commands that I’ve tested and confirm to work…
Step 1. Isolate the
<cto>tags to be on new lines.Step 2. Select only the lines starting with a
<cto>tag.Step 3. Format the new XML document.
Thanks for all of your respective inputs.