I am writing some bottom-up parsers for PHP, JavaScript, and CSS. Preferably, I would like to write one parser that will be able to parse all the languages. I heard somewhere that JavaScript could be parsed with an LALR(1) parser (correct me if I’m wrong, however). Would a LALR(1) parser be sufficient for PHP and CSS, or will I need to write something different?
I am writing some bottom-up parsers for PHP, JavaScript, and CSS. Preferably, I would
Share
I doubt you can implement one parser to parse all 3 of these languages. I think you’ll need 3 parsers. They may share the parsing engine, if that’s what you mean.
You can make pretty much any parsing technology parse any language, by accepting “too much” (because the parsing machinery isn’t strong enough to discriminate) and adding post-parsing processing of the captured structure (typically ASTs) to inspect/handle/eliminate the excess accepted.
The argument is is just how much excess you have to collect, and how painful is it to eliminate the excess accepted.
So, LALR(1) will do it. There are existence proofs, too; the PHP interpreter is implemented using Bison (LALR(1)); you can discover this for yourself by downloading the PHP tarball and digging around in it.
I don’t think CSS is a tough grammar. I think there’s a lot of it, though.
JavaScript will give you a bad time with the missing semicolon problem, because it is defined as “if the parser would give you a error without it, and it is not present, pretend it is present”. So in essence you have to abuse the error handling machinery in the parser to recover.
You’re looking at a lot of work. Wouldn’t it be easier to get existing parsers? Or do you want one unified set of machinery for a reason?