I’m trying to make a Bison parser to handle UTF-8 characters. I don’t want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes.
Right now, Bison generates the following code which is problematic:
if (yychar <= YYEOF)
{
yychar = yytoken = YYEOF;
YYDPRINTF ((stderr, "Now at end of input.\n"));
}
The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops.
Is there a way around this?
bisonyes,flexno. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my ownyylexfunction.edit: To help, I used a lot of the Unicode operations available in glib (there’s a
gunicodetype and some file/string manipulation functions that I found useful).