I am writing a Chrome Extension to convert HTML pages into a different format.
If I use document.getElementsByTagName("*") and iterate over that collection, I can see all the tags. However, it’s a flat representation. I need to detect the opening and closing “events”, like a SAX parser, so that my translated output maintains proper containment/nesting.
What is the right way to do this in JavaScript? It seems a little awkward to have to do this manually. Is there any other way to do this?
To illustrate what I mean…
<html>
<body>
<h1>Header</h1>
<div>
<p>some text and a missing closing tag
<p>some more text</p>
</div>
<p>some more dirty HTML
</body>
<html>
I need to get the events in this order:
html open
body open
h1 open
text
h1 close
div open
p open
text
p close
p open
text
p close
div close
p open
text
p close
body close
html close
I get the feeling it’s up to me to track the SAX-parser-like events as part of my iteration. Are there any other options available to me? If not, can you point me to any sample code?
Thanks!
Just traverse each node and all the children of each node. When a level of children is exhausted, the tag is closed.
To traverse the whole page, just do
parseChildren(document.documentFragment). You can replace theconsole.logstatements with whatever behavior you like.Note that this code reports a lot of
textnodes, because the whitespace between tags counts as a text node. To avoid this, just expand the text handling code: