I’m writing an algorithm in C++ that scans a file with a “sliding window,” meaning it will scan bytes 0 to n, do something, then scan bytes 1 to n+1, do something, and so forth, until the end is reached.
My first algorithm was to read the first n bytes, do something, dump one byte, read a new byte, and repeat. This was very slow because to “ReadFile” from HDD one byte at a time was inefficient. (About 100kB/s)
My second algorithm involves reading a chunk of the file (perhaps n*1000 bytes, meaning the whole file if it’s not too large) into a buffer and reading individual bytes off the buffer. Now I get about 10MB/s (decent SSD + Core i5, 1.6GHz laptop).
My question: Do you have suggestions for even faster models?
edit: My big buffer (relative to the window size) is implemented as follows:
– for a rolling window of 5kB, the buffer is initialized to 5MB
– read the first 5MB of the file into the buffer
– the window pointer starts at the beginning of the buffer
– upon shifting, the window pointer is incremented
– when the window pointer nears the end of the 5MB buffer, (say at 4.99MB), copy the remaining 0.01MB to the beginning of the buffer, reset the window pointer to the beginning, and read an additional 4.99MB into the buffer.
– repeat
edit 2 – the actual implementation (removed)
Thank you all for many insightful response. It was hard to select a “best answer”; they were all excellent and helped with my coding.
I use a sliding window in one of my apps (actually, several layers of sliding windows working on top of each other, but that is outside the scope of this discussion). The window uses a memory-mapped file view via
CreateFileMapping()andMapViewOfFile(), then I have an an abstraction layer on top of that. I ask the abstraction layer for any range of bytes I need, and it ensures that the file mapping and file view are adjusted accordingly so those bytes are in memory. Every time a new range of bytes is requested, the file view is adjusted only if needed.The file view is positioned and sized on page boundaries that are even multiples of the system granularity as reported by
GetSystemInfo(). Just because a scan reaches the end of a given byte range does not necessarily mean it has reached the end of a page boundary yet, so the next scan may not need to alter the file view at all, the next bytes are already in memory. If the first requested byte of a range exceeds the right-hand boundary of a mapped page, the left edge of the file view is adjusted to the left-hand boundary of the requested page and any pages to the left are unmapped. If the last requested byte in the range exceeds the right-hand boundary of the right-most mapped page, a new page is mapped and added to the file view.It sounds more complex than it really is to implement once you get into the coding of it:
Creating a View Within a File
It sounds like you are scanning bytes in fixed-sized blocks, so this approach is very fast and very efficient for that. Based on this technique, I can sequentially scan multi-GIGBYTE files from start to end fairly quickly, usually a minute or less on my slowest machine. If your files are smaller then the system granularity, or even just a few megabytes, you will hardly notice any time elapsed at all (unless your scans themselves are slow).
Update: here is a simplified variation of what I use:
.
This code is designed to allow random jumping anywhere in the file at any data size. Since you are reading bytes sequentially, some of the logic can be simplified as needed.