I’m attempting to write an application to extract properties and code from proprietary IDE design files. The file format looks something like this:
HEADING
{
SUBHEADING1
{
PropName1 = PropVal1;
PropName2 = PropVal2;
}
SUBHEADING2
{
{ 1 ; PropVal1 ; PropValue2 }
{ 2 ; PropVal1 ; PropValue2 ; OnEvent1=BEGIN
MESSAGE('Hello, World!');
{ block comments are between braces }
//inline comments are after double-slashes
END;
PropVal3 }
{ 1 ; PropVal1 ; PropVal2; PropVal3 }
}
}
What I am trying to do is extract the contents under the subheading blocks. In the case of SUBHEADING2, I would also separate each token as delimited by the semicolons. I had reasonably good success with just counting the brackets and keeping track of what subheading I’m currently under. The main issue I encountered involves dealing with the code comments.
This language happens to use {} for block comments, which interferes with the brackets in the file format. To make it even more interesting, it also needs to take into account double-slash inline comments and ignore everything up to the end of the line.
What is the best approach to tackling this? I looked at some of the compiler libraries discussed in another article (ANTLR, Doxygen, etc.) but they seem like overkill for solving this specific parsing issue.
You should be able to put something together in a few hours, using regular expressions in combination with some code that uses the results.
Something like this should work:
– Initialize the process by loading the file into a string.
Following this, you would process HEADING, then the first SUBHEADING, then the second SUBHEADING, then each sub-block. For the sub-block containing the block comment, you would presumably know based on the block’s lack of a keyword that any sub-block is a comment, so there is no need to process the sub-blocks.