I’m splitting a potentially large string (let’s say 20MB, though this is entirely arbitrary)

Question

0

Asked: May 26, 20262026-05-26T21:46:50+00:00 2026-05-26T21:46:50+00:00

I’m splitting a potentially large string (let’s say 20MB, though this is entirely arbitrary)

0

I’m splitting a potentially large string (let’s say 20MB, though this is entirely arbitrary) into tokens defined by a list of regular expressions.

My current algorithm takes the following approach:

All regexes are optimized to have the zero-width assertion ^ at the start of them
For each regex in the list, I attempt to #slice! the input string
If we #slice! anything, we got a match AND the input string has been advanced ready to find the next token (since #slice! modifies the string)

Unfortunately this is slow, which is due to the repeated #slice! on the long string… it seems like modifying large strings in ruby isn’t fast.

So I wonder if there’s a way to match a my regexes against the new substring (i.e. the remainder of the string) without modifying it?

Current algorithm in (tested, runnable) pseudo-code:

  rules = {
    :foo => /^foo/,
    :bar => /^bar/,
    :int => /^[0-9]+/
  }

  input = "1foofoo23456bar1foo"
  # or if you want your computer to cry
  # input = "1foofoo23456bar1foo" * 1_000_000

  tokens = []

  until input.length == 0
    matched = rules.detect do |(name, re)|
      if match = input.slice!(re)
        tokens << { :rule => name, :value => match }
      end
    end

    raise "Uncomsumed input: #{input}" unless matched
  end

  pp tokens
  # =>
  [{:rule=>:int, :value=>"1"},
   {:rule=>:foo, :value=>"foo"},
   {:rule=>:foo, :value=>"foo"},
   {:rule=>:int, :value=>"23456"},
   {:rule=>:bar, :value=>"bar"},
   {:rule=>:int, :value=>"1"},
   {:rule=>:foo, :value=>"foo"}]

Note that while quite simply matching the regexes against the string an equivalent number of times is not fast by any means, it is not so slow that you’d have time to cook a pizza while you wait (a few seconds, vs many many minutes).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T21:46:50+00:00

The String#match() method has a two-parameter version, which will match a regular expression starting at a specific character position in the string. You just need to get one-past-the-last-matching-character from the previous match as the starting position for the new match.

In untested, not-run pseudo-code:

input = "foo"
input_pos = 0
input_end = input.length

until input_pos == input_end do
  matched = rules.detect do |(name, re)|
    if match = input.match(re, input_pos)
        tokens << { :rule => name, :value => match }
        input_pos = match.post_match
    end
  end
end

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m splitting a potentially large string (let’s say 20MB, though this is entirely arbitrary)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply