It seems that flex doesn’t support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.
Is there a way to force flex to eat my UTF-8 chars? I don’t want it to actually match UTF-8 chars, just eat them when using the ‘.’ pattern.
Any suggestion?
EDIT
The most simple solution would be:
ANY [\x00-\xff]
and use ‘ANY’ instead of ‘.’ in my rules.
I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair …
UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.
A common method so far is:
Taken from the mailing list.
I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a ‘.’ substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.
hope this helps!