I have to parse a document containing groups of variable-value-pairs which is serialized to a string e.g. like this:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Here are the different elements:
-
Group IDs:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
-
Length of string representation of each group:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
-
One of the groups:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14 ^VAR1^6^VALUE1^^
-
Variables:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
-
Length of string representation of the values:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
-
The values themselves:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Variables consist only of alphanumeric characters.
No assumption is made about the values, i.e. they may contain any character, including ^.
Is there a name for this kind of grammar? Is there a parsing library that can handle this mess?
So far I am using my own parser, but due to the fact that I need to detect and handle corrupt serializations the code looks rather messy, thus my question for a parser library that could lift the burden.
The simplest way to approach it is to note that there are two nested levels that work the same way. The pattern is extremely simple:
At the outer level, this produces a set of groups. Within each group, the
contentfollows exactly the same pattern, only here theidis the variable name, and thecontentis the variable value.So you only need to write that logic once and you can use it to parse both levels. Just write a function that breaks a string up into a list of
id/contentpairs. Call it once to get the groups, and then loop through them calling it again for eachcontentto get the variables in that group.Breaking it down into these steps, first we need a way to get “tokens” from the string. This function returns an object with three methods, to find out if we’re at “end of file”, and to grab the next delimited or counted substring:
Now we can conveniently write the reusable parse function:
It builds an object where the keys are the IDs (or variable names). I’m asuming as they have names that the order isn’t significant.
Then we can use that at both levels to create the function to do the whole job:
For your example, it produces this object: