I’m looking for an HTML or XML parser that lets one access the offset/position of the current element in the input string or file.
For example if walking through this string:
<div>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit</p>
<p>sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
</div>
I’m looking for a way to get the starting position (including whitespace) of each <p> tag, here: 7 and 72.
It’d be great if a PHP parser supported that natively (I’ve looked at DOM, XMLReader, and other libraries mentionned in this SO question but haven’t found a way to do it), but otherwise any language/framework would be fine.
Note: Related to this question, but less localized.
Maybe you could use Generic XML parser class (also on github).
According to the author’s description:
I’ve tested it with this code:
The
test.xmlfile contains your sample HTML snippet.By running the script from the command line I get this output:
So, the
Bytefield is probably what you’re looking for.For a better understanding of how it works, have also a look at its source code.