I am trying to parse large strings that I have loaded into memory from

Question

0

Asked: June 10, 20262026-06-10T09:58:38+00:00 2026-06-10T09:58:38+00:00

I am trying to parse large strings that I have loaded into memory from

0

I am trying to parse large strings that I have loaded into memory from a file. I am parsing DNA sequences (stored as string) with a sliding window of variable length. The problem is that the strings are so large that it takes a very long time to iterate through them. I don’t know if it is even possible but is it possible to speed this up somehow?

I mean I expected I/O to dominate my application so I shift my line by line reading to reading the whole file into memory at once, but after testing my code I found it spends most of the time in a loop like this:

size_t currentCharNumber = 0;
int16_t windowSize = 50;
//seq is a string of length 249250621
while(seq.length() - currentLinePos < windowSize)
{
   string temp = seq.substr(currentLinePos, windowSize);
   //do stuff to temp
   ++currentLinePos;
}

It only take seconds for loading the sequence into memory from a file but takes ~30 mins to parse the sequence (even after commenting out the processing below the substr() call). Is there something I’m missing that is adding a lot of overhead or is it probably due to the size of my data?

Would it be helpful to mention that I can ignore substrings with characters other that ATCG? I mean I do this filtering in my code but only after I get the strings from substr.

This is my first time posting, and my C++ is a little rusty. Any feedback would be greatly appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T09:58:39+00:00

You might want to consider switching from using a string to hold your sliding window to using a std::deque<char>. The deque type is optimized for inserting and removing values at either end, and is therefore a good candidate here. You can start by loading the first 50 characters into the deque, and then can adjust your loop like this:

/* Initialize the window to the first windowSize characters. */
std::deque<char> window(seq.begin(), seq.begin() + windowSize);

/* Repeatedly process each window. */
for (size_t i = windowSize; i < seq.length(); ++i) {
    /* Do something to window */

    /* Drop the first character from the window, then add the next character
     * of the sequence.
     */
    window.pop_front();
    window.push_back(seq[i]);     
}

This makes the time to construct each window O(1) instead of O(k), where k is the number of characters in the window. This could dramatically decrease the runtime, since the number of characters in your window is pretty large.

Hope this helps!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse large strings that I have loaded into memory from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply