I am trying to come up with a GNU extended regular expression that detects

Question

0

Asked: May 16, 20262026-05-16T12:01:09+00:00 2026-05-16T12:01:09+00:00

I am trying to come up with a GNU extended regular expression that detects

0

I am trying to come up with a GNU extended regular expression that detects repeated substrings in a string of ascii-encoded bits. I have an expression that works — sort of. The problem is that it executes really slowly when given a string that could have many solutions

The expression

([01]+)(\1)+

compiles quickly, but takes about a minute to execute against the string

1010101010101010101010101010101010101010101010101010101010

I am using the regex implementation from glibc 2.5-49 ( comes with CentOS 5.5.)

FWIW, the pcre library executes quickly, as in gregexp or perl directly. So the obvious, but wrong, answer is “use libpcre”. I cannot easily introduce a new dependency in my project. I need to work within the std C library that comes with CentOS/RHEL.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T12:01:09+00:00

If the input string can be of any considerable length, or if performance is at all a concern, then one of the better ways to solve this problem is not with regex, but with a more sophisticated string data structure that facilitates these kinds of queries much more efficiently.

Such a data structure is a suffix tree. Given a string S, its suffix tree is essentially the Patricia trie of all of its suffixes. Despite its seeming complexity, it can be built in linear time.

Suffix tree for "BANANA"(courtesy of Wikipedia)

You can do many kinds of queries really efficiently with a suffix tree, e.g. finding all occurences of a substring, the longest substring that occurs at least twice, etc. The kind of strings that you’re after is called tandem repeats. To facilitate this query you’d have to preprocess the suffix tree in linear time so you can do lowest common ancestor queries in constant time.

This problem is very common in computational biology, where the DNA can be viewed as a VERY long string consisting of letters in ACGT. Thus, performance and efficiency is of utmost importance, and these very sophisticated algorithms and techniques were devised.

You should look into either implementing these techniques from scratch for your binary sequence, or perhaps it’s easier to map your binary sequence to a “fake” DNA string and then using one of the many tools available for gene research.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to come up with a GNU extended regular expression that detects

Leave an answerCancel reply

1 Answer

See also

Leave an answer
Cancel reply