So I am working on constructing a lexer/parser pair using parser combinators which leaves me with some interesting problems. Now the specific problem this question is regarding I have actually solved but I am not completely happy with my solution.
module Program =
type Token = { value:string; line:int; column:int; }
let lineTerminators = set ['\u000A'; '\u000D'; '\u2028'; '\u2029']
let main () =
let token = { value = "/*\r\n *\r\n *\r\n \n */"; line = 1; column = 1; }
let chars = token.value.ToCharArray()
let totalLines =
chars
|> Array.mapi(
fun i c ->
if not (lineTerminators.Contains c) then
0
else
if c <> '\n' || i = 0 || chars.[i - 1] <> '\r' then
1
else
0
)
|> Array.sum
let nextLine = token.line + totalLines
let nextColumn =
if totalLines = 0 then
token.column + token.value.Length
else
1 + (chars
|> Array.rev
|> Array.findIndex lineTerminators.Contains)
System.Console.ReadKey true |> ignore
main()
One problem with your implementation is that it originally seems to expect that all line terminators are just single characters, but that’s actually not the case – if you treat "\r\n" as a single line terminator (composed from 2 characters), then the situation should be more clear. For example, I would declare terminators like this:
The order is significant – if we find "\r\n" first then we want to skip 2 characters (so that we don’t count the next ‘\n’ char as the next terminator). Unfortunately "skip 2 characters" is a bit tricky – it cannot be done using
mapifunction, which calls the function for each element.A direct implementation using a recursive function could look like this:
If you were using some parser combinator library (such as FParsec), then you could use built-in parsers for most of the things. I didn’t actually try this, but here is a rough sketch – you could store the list of terminators as a list of strings and generate parser for each of the string:
This gives you a list of parsers that return 1 when there is a terminator at the end – now you could aggregate all of them using
<|>(or combinator) and then run the composed parser. If that fails, you can skip the first character (combine this with another parser) and continue recursively. The only problem is that parser combinators usually return all possible derivations ("\r\n" can be interpreted as two line breaks..), so you would need to get just the first result…(From your question, it wasn’t clear whether you actually want to use some parser combinator library or not, so I didn’t elaborate on that topic – if you’re interested, you can ask for more details…)