I am trying to parse large strings that I have loaded into memory from a file. I am parsing DNA sequences (stored as string) with a sliding window of variable length. The problem is that the strings are so large that it takes a very long time to iterate through them. I don’t know if it is even possible but is it possible to speed this up somehow?
I mean I expected I/O to dominate my application so I shift my line by line reading to reading the whole file into memory at once, but after testing my code I found it spends most of the time in a loop like this:
size_t currentCharNumber = 0;
int16_t windowSize = 50;
//seq is a string of length 249250621
while(seq.length() - currentLinePos < windowSize)
{
string temp = seq.substr(currentLinePos, windowSize);
//do stuff to temp
++currentLinePos;
}
It only take seconds for loading the sequence into memory from a file but takes ~30 mins to parse the sequence (even after commenting out the processing below the substr() call). Is there something I’m missing that is adding a lot of overhead or is it probably due to the size of my data?
Would it be helpful to mention that I can ignore substrings with characters other that ATCG? I mean I do this filtering in my code but only after I get the strings from substr.
This is my first time posting, and my C++ is a little rusty. Any feedback would be greatly appreciated.
You might want to consider switching from using a
stringto hold your sliding window to using astd::deque<char>. Thedequetype is optimized for inserting and removing values at either end, and is therefore a good candidate here. You can start by loading the first 50 characters into thedeque, and then can adjust your loop like this:This makes the time to construct each window O(1) instead of O(k), where
kis the number of characters in the window. This could dramatically decrease the runtime, since the number of characters in your window is pretty large.Hope this helps!