Note: This is a follow up to this question.
I have a “legacy” program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I’m taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I’d like to keep the solution as cross-platform as possible so I don’t think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file “format” which adds a layer of complexity I’d rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I’m mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I’m not sure just how slow the current implementation is. So it’s hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a ‘roll your own regex system’ idea. Rather similar to the suggestion to use a parser generator.
Edit: I’ve used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can’t be too big, but it’s another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.