I’m writing a really simple lexer for doing syntax highlighting of arbitrary text, one of which is HTML. The goal of the lexer is just to provide a flat stream of tokens.
I started with the XML tutorial on the Antlr3 website, but am having some trouble with script tags.
An example of the HTML which causes this problem:
<head>
<script>alert(2 < 3);</script>
</head>
And the grammar..
@members {
boolean inTag = false;
}
TAG_START_OPEN : '<'
{ inTag = true; } ;
TAG_END_OPEN : '</'
{ inTag = true; } ;
TAG_CLOSE : { inTag }?=> '>' { inTag = false; } ;
TAG_SELF_CLOSE : { inTag }?=> '/>' { inTag = false; } ;
PCDATA : { !inTag }?=> (~'<')+ ;
// ...
The problem is that the lexer gets confused when seeing the ‘<‘ tag within the Javascript code and thinks it is a close tag. I guess the goal would be for the lexer to use lookahead to determine whether a ‘<‘ is proceeded by ‘/script>’ if the open tag was a script tag, however I’m unsure of how to do this nicely with ANTLR.
Thanks in advance for any help.
Here’s a quick demo of how you could accomplish this:
If I now parse the input:
the following AST will be created by the generated parser: