I’m trying to create a parser (with parsec), that parses tokens, delimited by newlines, commas, semicolons and unicode dashes (ndash and mdash):
authorParser = do
name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')
But the ndash-mdash (\8212, \8213) part never ‘succeeds’ and i’m getting invalid parse results.
How do i specify unicode dashes with char parser?
P.S. I’ve tried (chr 8212), (chr 8213) too. It doesn’t helps.
ADDITION: It is better to use Data.Text. The switch from ByteStrings madness to Data.Text saved me a lot of time and ‘source space’ 🙂
Works for me:
How did you try?
The above was using plain
String, which works without problems because aCharis a full uncode code point. It’s not as nice with other types of input stream.Textwill probably also work well for this example, I think that the dashes are encoded as a single code unit there. ForByteString, however, things are more complicated. If you’re using plainData.ByteString.Char8(strict or lazy, doesn’t matter), theChars get truncated on packing, only the least significant 8 bits are retained, so ‘\8212’ becomes 20 and ‘\8213’ becomes 21. If the input stream is constructed the same way, that still kind of works, only allChars congruent to 20 or 21 modulo 256 will be mapped to the same as one of the dashes.However, it is likely that the input stream is UTF-8 encoded, then the dashes are encoded as three bytes each, “\226\128\148” resp. “\226\128\149”, which doesn’t match what you get by truncating. Trying to parse utf-8 encoded text with
ByteStringandparsecis a bit more involved, the units of which the parse result is composed are not single bytes, but sequences of bytes, 1-4 in length each.To use
noneOf, you need anwhich does the right thing. The instance provided in
Text.Parsec.ByteString[.Lazy]doesn’t, it uses theData.ByteString[.Lazy].Char8interface, so an en-dash would become a single ‘\20’ not matching ‘\8212’ or produce threeChars, ‘\226’, ‘\128’ and ‘\148’ in three successive calls touncons, none of which matches ‘\8212’ either, depending on how the input was encoded.