I am writing a Chrome Extension to convert HTML pages into a different format.

Question

0

Asked: June 9, 20262026-06-09T23:27:20+00:00 2026-06-09T23:27:20+00:00

I am writing a Chrome Extension to convert HTML pages into a different format.

0

I am writing a Chrome Extension to convert HTML pages into a different format.

If I use document.getElementsByTagName("*") and iterate over that collection, I can see all the tags. However, it’s a flat representation. I need to detect the opening and closing “events”, like a SAX parser, so that my translated output maintains proper containment/nesting.

What is the right way to do this in JavaScript? It seems a little awkward to have to do this manually. Is there any other way to do this?

To illustrate what I mean…

   <html>
       <body>
           <h1>Header</h1>
           <div>
               <p>some text and a missing closing tag
               <p>some more text</p>
           </div>
           <p>some more dirty HTML
        </body>
    <html>

I need to get the events in this order:

    html open
    body open
    h1 open
    text
    h1 close
    div open
    p open
    text
    p close
    p open
    text
    p close
    div close
    p open
    text
    p close
    body close
    html close

I get the feeling it’s up to me to track the SAX-parser-like events as part of my iteration. Are there any other options available to me? If not, can you point me to any sample code?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T23:27:22+00:00

Just traverse each node and all the children of each node. When a level of children is exhausted, the tag is closed.

function parseChildren(node) {

    // if this a text node, it has no children or open/close tags
    if(node.nodeType == 3) {
        console.log("text");
        return;
    }

    console.log(node.tagName.toLowerCase() + " open");

    // parse the child nodes of this node
    for(var i = 0; i < node.childNodes.length; ++i) {
        parseChildren(node.childNodes[i]);
    }

    // all the children are used up, so this tag is done
    console.log(node.tagName.toLowerCase() + " close");
}

To traverse the whole page, just do parseChildren(document.documentFragment). You can replace the console.log statements with whatever behavior you like.

Note that this code reports a lot of text nodes, because the whitespace between tags counts as a text node. To avoid this, just expand the text handling code:

    if(node.nodeType == 3) {
        // if this node is all whitespace, don't report it
        if(node.data.replace(/\s/g,'') == '') { return; }

        // otherwise, report it
        console.log("text");
        return;
    }

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a Chrome Extension to convert HTML pages into a different format.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply