I’m trying to write a parser using Parsec that will parse literate Haskell files, such as the following:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
I’ve written the following, sort-of-inspired by the examples in RWH:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Which I hoped would result in something along the lines of:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(allowing for whitespace etc).
This compiles fine, but when run, I get the error:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Can anyone shed any light on this, and possibly help with a solution please?
As sth pointed out
many anyCharis the problem. But not just inprosebut also incode. The problem withcodeis, thatcontent <- many anyCharwill consume everything: The newlines and the\end{code}tag.So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:
Now, you don’t completely have the desired result, because the
Haskellpart will also contain newlines, but you can filter these out quite easily (given a functionfilterNewlinesyou could say`content <- filterNewlines <$> (many $ noneOf "\\")).Edit
Okay, I think I found a solution (requires the newest Parsec version, because of
lookAhead):untilP pparses a line, then checks if the beginning of the next line can be successfully parsed byp. If so, it returns the empty string, otherwise it goes on. ThelookAheadis needed, because otherwise the begin\end-tags would be consumed andcodecouldn’t recognize them.I guess it could still be made more concise (i.e. not having to repeat
string "\\end{code}\n"insidecode).