I am writing my first parser and have a few questions conerning the tokenizer.

Question

0

Asked: May 15, 20262026-05-15T10:22:56+00:00 2026-05-15T10:22:56+00:00

I am writing my first parser and have a few questions conerning the tokenizer.

0

I am writing my first parser and have a few questions conerning the tokenizer.

Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:

SYMBOL (such as <, :=, ( and the like
WHITESPACE (tab, newlines, spaces…)
REMARK (a comment between /* … */ or after // through the new line)
NUMBER
IDENT (such as the name of a function or a variable)
STRING (Something enclosed between “….”)

Now, do you think this makes sense?

Also, I am struggling with the NUMBER token-type. Do you think it makes more sense to further split it up into a NUMBER and a FLOAT token-type? Without a FLOAT token-type, I’d receive NUMBER (eg 402), a SYMBOL (.) followed by another NUMBER (eg 203) if I were about to parse a float.

Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909? Should it return the SYMBOL - first, followed by the NUMBER 909 or should it return a NUMBER -909 right away?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T10:22:56+00:00

It depends upon your target language.

The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER when it sees a symbol that matches “[0-9]+”. If it sees a non-integer number, such as “3.1415926” it will return NUMBER . NUMBER. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren’t relevant to your parser) then you could end up incorrectly parsing things like “123 /* comment / . \n / other comment */ 456″ as floating point numbers.

As for lexing “-[0-9]+” as a NUMBER vs MINUS NUMBER again, that depends upon your target language, but I would usually go with MINUS NUMBER, otherwise you would end up lexing “A = 1-2-3-4” as SYMBOL = NUMBER NUMBER NUMBER NUMBER instead of SYMBOL = NUMBER MINUS NUMBER MINUS NUMBER MINUS NUMBER.

While we’re on the topic, I’d strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing my first parser and have a few questions conerning the tokenizer.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply