I want to split a C file into tokens, not for compiling but for analyzing. I feel like this should be pretty straight-forward, and tried looking online for a defined tokens.l (or something similar) file for flex with all the C grammar already defined, but couldn’t find anything. I was wondering if there are any sort of defined grammars floating around, or if perhaps I’m going about this all wrong?
I want to split a C file into tokens, not for compiling but for
Share
Yes, there’s at least one around.
Edit:
Since there are a few issues that doesn’t handle, perhaps it’s worth looking at some (hand written) lexing code I wrote several years ago. This basically only handles phases 1, 2 and 3 of translation. If you define DIGRAPH, it also turns on some code to translate C++ digraphs. If memory serves, however, it’s doing that earlier in translation than it should really happen, but you probably don’t want it in any case. OTOH, this does not even attempt to recognize anywhere close to all tokens — mostly it separates the source into comments, character literals, string literals, and pretty much everything else. OTOH, it does handle trigraphs, line splicing, etc.
I suppose I should also add that this leaves conversion of the platform’s line-ending character into a new-line to the underlying implementation by opening the file in translated (text) mode. Under most circumstances, that’s probably the right thing to do, but if you want to produce something like a cross-compiler where your source files have a different line-ending sequence than is normal for this host, you might have to change that.
First the header that defines the external interface to all this stuff:
And then the implementation of all that:
I’m not sure about how easy/difficult it would/will be to integrate that into a Flex-based lexer though — I seem to recall Flex has some sort of hook to define what it uses to read a character, but I’ve never tried to use it, so I can’t say much more about it (and ultimately, can’t even say with anything approaching certainty that it even exists).