I am writing a simple parser for C. I was just running it with some other language files (for fun – to see the extent of C-likeness and laziness – don’t wanna really write separate parsers for each language if I can avoid it).
However the parser seems to break down for JavaScript if the code being parsed contains regular expressions…
Case 1:
For example, while parsing the JavaScript code snippet,
var phone="(304)434-5454"
phone=phone.replace(/[\(\)-]/g, "")
//Returns "3044345454" (removes "(", ")", and "-")
The ‘(‘, ‘[‘ etc get matched as starters of new scopes, which may never be closed.
Case 2:
And, for the Perl code snippet,
# Replace backslashes with two forward slashes
# Any character can be used to delimit the regex
$FILE_PATH =~ s@\\@//@g;
The // gets matched as a comment…
How can I detect a regular expression within the content text of a “C-like” program-file?
It is impossible.
Take this, for example:
Could be both C or perl.
One minute’s thinking reveals, that the number of perl style regular expressions that are also sntyctically valid C expressions is infinite.
Another example:
The best you can get is some extreme vague guesswork. The difficulty stems from the fact that a regular expression is a sequence of characters that can be virtually everything.
You better clean up your error handling. A parser should not “break down” if some parenthesis are missing or superfluous ones are seen.