Let’s say I want to make a parser for a programming language (EBNF already known), and want it done with as little of a fuss as possible. Also, I want to support identifiers of any UTF-8 letters. And I want it in C++.
flex/bison have a non-existent UTF-8 support, as I read it. ANTLR seems not to have a working C++ output.
I’ve considered boost::spirit, they state on their site it’s actually not meant for a full parser.
What else is left? Rolling it entirely per hand?
If you don’t find something which has the support you want, don’t forget that flex is mostly independant on the encoding. It lexes an octet stream and I’ve used it to lex pure binary data. Something encoded in UTF-8 is an octet stream and can be handled by flex is you accept to do manually some of the work. I.E. instead of having
if you want to accept as letter everything in the range Latin1 supplement excepted the NBSP (in other words, in the range U00A1-U00FF) you have to do something like (I may have messed up the encoding, but you get the idea)
You could even write a preprocessor which does most of the work for you (i.e. replaces \u00A1 by \xC2\xA1 and replace [\u00A1-\u00FF] by \xC2[\xA1-\xFF]|\xC3[\x80-\xBF], how much work is the preprocessor depend on how generic you want your input to be, there will be a time when you’d probably better integrate the work in flex and contribute it upstream)