I have a set of strings with numbers embedded in them. They look something like /cal/long/3/4/145:999 or /pa/metrics/CosmicRay/24:4:bgp:EnergyKurtosis. I’d like to have an expression parser that is
- Easy to use. Given a few examples someone should be able to form a new expression. I want end users to be able to form new expressions to query this set of strings. Some of the potential users are software engineers, others are testers and some are scientists.
- Allows for constraints on numbers. Something like ‘/cal/long/3/4/143:#>100&<1110’ to specify that a string prefix with ‘/cal/long/3/4/143:’ and then a number between (100,1110) is expected.
- Supports ‘|’ and . So the expression ‘/cal/(long|short)/3/4/‘ would match ‘/cal/long/3/4/1:2’ as well as ‘/cal/short/3/4/1:2’.
- Has a Java implementation available or would be easy to implement in Java.
Interesting alternative ideas would be useful. I’m also entertaining the idea of just implementing the subset of regular expressions that I need plus the numerical constraints.
Thanks!
I’m inclined to agree with Rex M, although your second requirement for numerical constraints complicates things. Unless you only allowed very basic constraints, I’m not aware of a way to succinctly express that in a regular expression. If there is such a way, please disregard the rest of my answer and follow the other suggestions here. 🙂
You might want to consider a parser generator – things like the classic lex and yacc. I’m not really familiar with the Java choices, but here’s a list:
http://java-source.net/open-source/parser-generators
If you’re not familiar, the standard approach would be to first create a lexer that turns your strings into tokens. Then you would pass those tokens onto a parser that applies your grammar to them and spits out some kind of result.
In your case, I envision the parser resulting in a combination of a regular expression and additional conditions. For your numerical constraint example, it might give you the regular expression
\/cal/long/3/4/143:(\d+)\and a constraint to apply to the first grouping (the\d+portion) that requires that the number lie between 100 and 1100. You’d then apply the RE to your strings for candidates, and apply the constraint to those candidates to find your matches.It’s a pretty complicated approach, so hopefully there’s a simpler way. I hope that gives you some ideas, at least.