I’ve just run into a pathological case with HTML parsing. I’ve always thought that a <script> tag would run until the first closing </script> tag. But it turns out this is not always the case.
This is valid:
<script><!--
alert('<script></script>');
--></script>
And even this is valid:
<script><!--
alert('<script></script>');
</script>
But this is not:
<script><!--
alert('</script>');
--></script>
And neither is this:
<script>
alert('<script></script>');
</script>
This behavior is consistent in Firefox and Chrome. So, as hard as it is to believe, browsers seem to accept an open+close script tag inside an html comment inside a script tag. So the question is how do browser really parse script tags?
This matters because the HTML parsing library I’m using, Nokogiri, assumed the obvious (but incorrect) until-the-first-closing-tag rule and did not handle this edge case. I imagine most other libraries would not handle it either.
After poring over the links given by Tim and Jukka I came to the following answer:
<script>tag, the parser goes to data1 state<!--is encountered while in data1 state, switch to data2 state-->is encountered while in any state, switch to data1 state<script[\s/>]is encountered while in data2 state, switch to data3 state</script[\s/>]is encountered while in data3 state, switch to data2 state</script[\s/>]is encountered while in any other state, stop parsing