Let’s say I have a paragraph of text like so:
Snails can be found in a very wide
range of environments including
ditches, deserts, and the abyssal
depths of the sea. Numerous kinds of snail can
also be found in fresh waters. (source)
I have 10,000 regex rules to match text, which can overlap. For example, the regex /Snails? can/i will find two matches (italicized in the text). The regex /can( also)? be/i has two matches (bolded).
After iterating through my regexes and finding matches, what is the best data structure to use, that given some place in the text, it returns all regexes that mached it? For example, if I want the matches for line 1, character 8 (0-based, which is the a in can), I would get a match for both regexes previously described.
I can create a hashmap: (key: character location, value: set of all matching regexes). Is this optimal? Is there a better way to parse the text with thousands of regexes (to not loop through each one)?
Thanks!
Storing all of the matches in a dictionary will work, but will it means you’ll have to store all of the matches in memory at the same time. If your data is small enough to easily fit into memory, don’t worry about it. Just do what works and move on.
If you do need to reduce memory usage of increase speed it really depends on how you are using the data. For example, if you process positions starting at the beginning and going to the end, you could use re.finditer to iteratively process all of the regexes and not maintain extra matches in memory longer then needed.