Just for my own purposes, I’m trying to build a tokenizer in Java where I can define a regular grammar and have it tokenize input based on that. The StringTokenizer class is deprecated, and I’ve found a couple functions in Scanner that hint towards what I want to do, but no luck yet. Anyone know a good way of going about this?
Share
The name ‘Scanner’ is a bit misleading, because the word is often used to mean a lexical analyzer, and that’s not what Scanner is for. All it is is a substitute for the
scanf()function you find in C, Perl, et al. Like StringTokenizer andsplit(), it’s designed to scan ahead until it finds a match for a given pattern, and whatever it skipped over on the way is returned as a token.A lexical analyzer, on the other hand, has to examine and classify every character, even if it’s only to decide whether it can safely ignore them. That means, after each match, it may apply several patterns until it finds one that matches starting at that point. Otherwise, it may find the sequence ‘//’ and think it’s found the beginning of a comment, when it’s really inside a string literal and it just failed to notice the opening quotation mark.
It’s actually much more complicated than that, of course, but I’m just illustrating why the built-in tools like StringTokenizer,
split()and Scanner aren’t suitable for this kind of task. It is, however, possible to use Java’s regex classes for a limited form of lexical analysis. In fact, the addition of the Scanner class made it much easier, because of the new Matcher API that was added to support it, i.e., regions and theusePattern()method. Here’s an example of a rudimentary scanner built on top of Java’s regex classes.This, by the way, is the only good use I’ve ever found for the
lookingAt()method. 😀