I’m splitting a potentially large string (let’s say 20MB, though this is entirely arbitrary) into tokens defined by a list of regular expressions.
My current algorithm takes the following approach:
- All regexes are optimized to have the zero-width assertion
^at the start of them - For each regex in the list, I attempt to
#slice!the input string - If we
#slice!anything, we got a match AND the input string has been advanced ready to find the next token (since#slice!modifies the string)
Unfortunately this is slow, which is due to the repeated #slice! on the long string… it seems like modifying large strings in ruby isn’t fast.
So I wonder if there’s a way to match a my regexes against the new substring (i.e. the remainder of the string) without modifying it?
Current algorithm in (tested, runnable) pseudo-code:
rules = {
:foo => /^foo/,
:bar => /^bar/,
:int => /^[0-9]+/
}
input = "1foofoo23456bar1foo"
# or if you want your computer to cry
# input = "1foofoo23456bar1foo" * 1_000_000
tokens = []
until input.length == 0
matched = rules.detect do |(name, re)|
if match = input.slice!(re)
tokens << { :rule => name, :value => match }
end
end
raise "Uncomsumed input: #{input}" unless matched
end
pp tokens
# =>
[{:rule=>:int, :value=>"1"},
{:rule=>:foo, :value=>"foo"},
{:rule=>:foo, :value=>"foo"},
{:rule=>:int, :value=>"23456"},
{:rule=>:bar, :value=>"bar"},
{:rule=>:int, :value=>"1"},
{:rule=>:foo, :value=>"foo"}]
Note that while quite simply matching the regexes against the string an equivalent number of times is not fast by any means, it is not so slow that you’d have time to cook a pizza while you wait (a few seconds, vs many many minutes).
The
String#match()method has a two-parameter version, which will match a regular expression starting at a specific character position in the string. You just need to get one-past-the-last-matching-character from the previous match as the starting position for the new match.In untested, not-run pseudo-code: