I’m looking at implementing a C preprocessor in two phases, where the first phase converts the source file into an array of preprocessing tokens. This would be good for simplicity and performance, as the work of tokenizing would not need to be redone when a header file is included by multiple files in a project.
The snag:
#define f(x) #x
main() {
puts(f(a+b));
puts(f(a + b));
}
According to the standard, the output should be:
a+b
a + b
i.e. the information about whether constituent tokens were separated by whitespace is supposed to be preserved. This would require the two-phase design to be scrapped.
The uses of the # operator that I’ve seen so far don’t actually need this, e.g. assert would still work fine if the output were always a + b regardless of whether the constituent tokens were separated by whitespace in the source file.
Is there any existing code anywhere that does depend on the exact behavior prescribed by the standard for this operator?
You might want to look at the preprocessor of the LCC compiler, written as an example ANSI C compiler for compiler courses. Another preprocessor is MCPP.
C/C++ preprocessing is quite tricky, if you stick to it make sure to get at least drafts of the relevant standards, and pilfer test suites somewhere.