I am looking for a fast algorithm for search purpose in a huge string (it’s a organism genome sequence composed of hundreds of millions to billions of chars).
There are only 4 chars {A,C,G,T} present in this string, and “A” can only pair with “T” while “C” pairs with “G”.
Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.
For example,
The string is: ATCAG GACCA TACGC CTGAT
Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10
The result should be
-
“ATCAG” pair with “CTGAT”
-
“TCAG” pair with “CTGA”
Thanks in advance.
Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.
I thought this was an interesting problem, so I put together a program based on considering ‘foldings’, which scans outward for possible symmetrical matches from different ‘fold points’. If N is the number of nucleotides and M is ‘maxInterval-minInterval’, you should have running time O(N*M). I may have missed some boundary cases, so use the code with care, but it does work for the example provided. Note that I’ve used a padded intermediate buffer to store the genome, as this reduces the number of comparisons for boundary cases required in the inner loops; this trades off additional memory allocation for better speed. Feel free to edit the post if you make any corrections or improvements.