I’m writing a parser for a language that is sufficiently simple for Genlex + camlp4 stream parsers to take care of it. However, I’d still be interested in having a more or less precise location (i.e. at least a line number) in case of parsing error.
My idea is to use an intermediate stream between the original char Stream and the token Stream of Genlex, that takes care of line counts, like in the code below, but I’m wondering whether there’s a simpler solution?
let parse_file s =
let num_lines = ref 1 in
let bol = ref 0 in
let print_pos fmt i =
(* Emacs-friendly location *)
Printf.fprintf fmt "File %S, line %d, characters %d-%d:"
s !num_lines (i - !bol) (i - !bol)
in
(* Normal stream *)
let chan =
try open_in s
with
Sys_error e -> Printf.eprintf "Cannot open %s: %s\n%!" s e; exit 1
in
let chrs = Stream.of_channel chan in
(* Capture newlines and move num_lines and bol accordingly *)
let next i =
try
match Stream.next chrs with
| '\n' -> bol := i; incr num_lines; Some '\n'
| c -> Some c
with Stream.Failure -> None
in
let chrs = Stream.from next in
(* Pass that to the Genlex's lexer *)
let toks = lexer chrs in
let error s =
Printf.eprintf "%a\n%s %a\n%!"
print_pos (Stream.count chrs) s print_top toks;
exit 1
in
try
parse toks
with
| Stream.Failure -> error "Failure"
| Stream.Error e -> error ("Error " ^ e)
| Parsing.Parse_error -> error "Unexpected symbol"
A much simpler solution is to use Camlp4 grammars.
Parsers built this way allow one to get decent error messages “for free”, unlike the case with stream parsers (which are a low level tool).
It could be that there is no need to define your own lexer, because OCaml’s lexer suits your needs already. But if you really need your own lexer, then you can easily plug in a custom one:
If you are new to OCaml, then all this module system trickery might seem at first like black voodoo magic 🙂 The fact that Camlp4 is a severely underdocumented beast might also contribute to the surreality of experience.
So never hesitate to ask a question (even a stupid one) on the mailing list.