I have a simple language which consists of patterns like
size(50*50)
start(10, 20, -x)
forward(15)
stop
It’s an example of turtle-drawing language. I need to properly tokenize it. The above is a source code instance. Statements and expressions are separated with newlines. I set up my Scanner to use delimiters like newlines. I expect next("start") to eat the string “start”, and then I issue next("(") to eat the first parenthesis. It appears however, that it does something else than I expect. Has the scanner already broken the above into tokens based on delimiter and/or do I need to approach this differently? For me, “start”, “(“, “50”, “*”, “50” and “)” on the first line would constitute separate tokens, which appears to be an unfulfilled expectation here. How can I tokenize the above with as little code as possible? I don’t currently need to write a tokenizer, I am writing an interpreter, so tokenizing is something I don’t want to spend my time on currently, I just like Scanner to work with me here.
My useDelimiter call is as follows:
Scanner s ///...
s.useDelimiter(Pattern.compile("[\\s]&&[^\\r\\n]"));
Issuing first next call gives me the entire file contents. Without the above call, it gives me entire first line.
To write a proper parser, you need to define your language in a formal grammar. Trust me, you want to do it properly or you will have problems downstream.
You can probably represent your tokens as regular expressions at the lowest level, but first you need to be clear about your grammar, which is combinations of tokens in lexical structures. You can represent this as recursive functions (methods), known as Productions. Each Production function can use scanner to test whether or not it is looking at a token it wants. But scanner will consume the input and you can’t reverse.
If you used Scanner, you will find the following things unsuitable:
It will always parse a token according to the regular expression,
1.1 so even if you do get a token you can use, you will have to write more code to decide exactly what token it was
1.2 and you may not be able to represent your language grammar as one big expression
I suggest you write the character lexer yourself, and iterate over a string / array of chars rather than a stream. Then you can re-wind.
Otherwise, use a ready-built lexer/parser framework like yacc or Coco/R.