I am looking for a fast algorithm for search purpose in a huge string

Question

0

Asked: May 28, 20262026-05-28T04:19:21+00:00 2026-05-28T04:19:21+00:00

I am looking for a fast algorithm for search purpose in a huge string

0

I am looking for a fast algorithm for search purpose in a huge string (it’s a organism genome sequence composed of hundreds of millions to billions of chars).

There are only 4 chars {A,C,G,T} present in this string, and “A” can only pair with “T” while “C” pairs with “G”.

Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.

For example,
The string is: ATCAG GACCA TACGC CTGAT

Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10

The result should be

“ATCAG” pair with “CTGAT”
“TCAG” pair with “CTGA”

Thanks in advance.

Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T04:19:21+00:00

I thought this was an interesting problem, so I put together a program based on considering ‘foldings’, which scans outward for possible symmetrical matches from different ‘fold points’. If N is the number of nucleotides and M is ‘maxInterval-minInterval’, you should have running time O(N*M). I may have missed some boundary cases, so use the code with care, but it does work for the example provided. Note that I’ve used a padded intermediate buffer to store the genome, as this reduces the number of comparisons for boundary cases required in the inner loops; this trades off additional memory allocation for better speed. Feel free to edit the post if you make any corrections or improvements.

class Program
{
    public sealed class Pairing
    {
        public int Index { get; private set; }

        public int Length { get; private set; }

        public int Offset { get; private set; }

        public Pairing(int index, int length, int offset)
        {
            Index = index;
            Length = length;
            Offset = offset;
        }
    }

    public static IEnumerable<Pairing> FindPairings(string genome, int minLen, int maxLen, int intervalMinLen, int intervalMaxLen)
    {
        int n = genome.Length;
        var padding = new string((char)0, maxLen);
        var padded = string.Concat(padding, genome, padding);

        int start = (intervalMinLen + minLen)/2 + maxLen;
        int end = n - (intervalMinLen + minLen)/2 + maxLen;

        //Consider 'fold locations' along the genome
        for (int i=start; i<end; i++)
        {
            //Consider 'odd' folding (centered on index) about index i
            int k = (intervalMinLen+2)/2;
            int maxK = (intervalMaxLen + 2)/2;
            while (k<=maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - k], padded[i + k]) && (k <= (maxK+maxLen)))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i-k - maxLen, matchLength, 2*k - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }

            //Consider 'even' folding (centered before index) about index i
            k = (intervalMinLen+1)/2;
            while (k <= maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - (k+1)], padded[i + k]) && (k<=maxK+maxLen))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i - (k+1) - maxLen, matchLength, 2*k + 1 - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }
        }
    }

    private const int SumAT = 'A' + 'T';
    private const int SumGC = 'G' + 'C';
    private static bool IsPaired(char a, char b)
    {
        return (a + b) == SumAT || (a + b) == SumGC;
    }


    static void Main(string[] args)
    {
        string genome = "ATCAGGACCATACGCCTGAT";
        foreach (var pairing in FindPairings(genome, 4, 5, 9, 10))
        {
            Console.WriteLine("'{0}' pair with '{1}'",
                              genome.Substring(pairing.Index, pairing.Length),
                              genome.Substring(pairing.Index + pairing.Offset, pairing.Length));
        }
        Console.ReadKey();
    }


}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking for a fast algorithm for search purpose in a huge string

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply