I need to parse files generated by a third-party application. Using ANTLR, I have constructed a parser that seemed to work fine, until I hit the following snag.
The file type is line-based and uses several keywords to define a hierarchical structure; so-called ‘blocks’, which themselves can have sub-blocks, and so on. Depending on the type of the current block, various lines have a special meaning, e.g. in one particular block, line #5 (relative to the block’s start) holds the author of the file, in another, line #3 is a file name, etc. All of these are essentially strings, i.e. the user can input anything they want for data when creating the file; but the fact that they are strings is known only implicitly, through the line number.
Because there are no quotation marks or anything to identify these strings by, my lexer occasionally tokenizes part of these texts (like numbers, or words that are identical to keywords), with the result that I can’t reliably reconstruct the original strings from the tokens in the parser’s rules.
Is it possible to handle this kind of file with a parser generator like I’m trying to? Since I am not very well-versed in parser construction, I hope there is a simple workaround or feature of ANTLR that will help overcome this small issue.
Do not use ANTLR or Yacc or any other similar tool for parsing such a grammar (with no distinct and context-independent set of pre-defined tokens).
A lexerless approach (like Packrat, or any other way of interpreting PEGs) would be better.
There are many Packrat implementations around, and it is not that difficult to code an ad hoc recursive descent PEG parser in any language, without any third party tools, especially for a trivial grammar with no specific performance requirements.