Considering a really huge file(maybe more than 4GB) on disk,I want to scan through this file and calculate the times of a specific binary pattern occurs.
My thought is:
-
Use memory-mapped file(CreateFileMap
or boost mapped_file) to load the
file to the virtual memory. -
For each 100MB mapped-memory,create one thread to scan and calculate the result.
Is this feasible?Are there any better method to do so?
Update:
Memory-mapped file would be a good choice,for scaning through a 1.6GB file could be handled within 11s.
thanks.
Multithreading is only going to make this go slower unless you want to scan multiple files with each on a different hard drive. Otherwise you are just going to seek.
I wrote a simple test function using memory mapped files, with a single thread a 1.4 Gb file took about 20 seconds to scan. With two threads, each taking half the file (even 1MB chunks to one thread, odd to the other), it took more than 80 seconds.
That’s right, 2 threads was Four times slower than 1 thread!
Here’s the code I used, this is the single threaded version, I used a 1 byte scan pattern, so the code to locate matches that straddle map boundaries is untested.