I have a list of files, which need to be read, in chunks, into a byte[], which is then passed to a hashing function. The tricky part is this: if I reach the end of a file, I need to continue reading the next file untill I fill the buffer, like so:
read 16 bits as an example:
File 1: 00101010
File 2: 01101010111111111
would need to be read as 0010101001101010
The point is: these files can be as large as several gigabytes, and I don’t want to completely load them into memory. Loading pieces into a buffer of, like, 30 MB would be perfectly fine.
I want to use threading, but would it be efficient to thread reading a file? I don’t know if Disc I/O is such a large bottleneck that this would be worth it. Would the hashing be sped up sufficently if I only thread that part, and lock on the read of each chunk? It is important the hashes get saved in the correct order.
The second thing I need to do, is to generate the MD5sum from each file as well. Is there anyway to do this more efficiently than doing this as a separate step?
(This question has some overlap with Is there a built-in way to handle multiple files as one stream?, but I thought this differed enough)
I am really stumped what approach to take, as I am fairly new to C#, as well as to threading. I already tried the approaches listed below, but they do not suffice for me.
As I am new to C# I value every kind of input on any aspect of my code.
This piece of code was threaded, but does not ‘append’ the streams, and as such generates invalid hashes:
public void DoHashing()
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = numThreads;
options.CancellationToken = cancelToken.Token;
Parallel.ForEach(files, options, (string f, ParallelLoopState loopState) =>
{
options.CancellationToken.ThrowIfCancellationRequested();
using (BufferedStream fileStream = new BufferedStream(File.OpenRead(f), bufferSize))
{
// Get the MD5sum first:
using (MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider())
{
md5.Initialize();
md5Sums[f] = BitConverter.ToString(md5.ComputeHash(fileStream)).Replace("-", "");
}
//setup for reading:
byte[] buffer = new byte[(int)pieceLength];
//I don't know if the buffer will f*ck up the filelenghth
long remaining = (new FileInfo(f)).Length;
int done = 0;
while (remaining > 0)
{
while (done < pieceLength)
{
options.CancellationToken.ThrowIfCancellationRequested();
//either try to read the piecelength, or the remaining length of the file.
int toRead = (int)Math.Min(pieceLength - done, remaining);
int read = fileStream.Read(buffer, done, toRead);
//if read == 0, EOF reached
if (read == 0)
{
remaining = 0;
break;
}
//offsets
done += read;
remaining -= read;
}
// Hash the piece
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
sha1.Initialize();
byte[] hash = sha1.ComputeHash(buffer);
hashes[f].AddRange(hash);
}
done = 0;
buffer = new byte[(int)pieceLength];
}
}
}
);
}
This other piece of code isn’t threaded (and doesn’t calculate MD5):
void Hash()
{
//examples, these got handled by other methods
List<string> files = new List<string>();
files.Add("a.txt");
files.Add("b.blob");
//....
long totalFileLength;
int pieceLength = Math.Pow(2,20);
foreach (string file in files)
{
totalFileLength += (new FileInfo(file)).Length;
}
//Reading the file:
long remaining = totalFileLength;
byte[] buffer = new byte[Math.min(remaining, pieceSize)];
int index = 0;
FileStream fin = File.OpenRead(files[index]);
int done = 0;
int offset = 0;
while (remaining > 0)
{
while (done < pieceLength)
{
int toRead = (int)Math.Min(pieceLength - offset, remaining);
int read = fin.Read(buffer, done, toRead);
//if read == 0, EOF reached
if (read == 0)
{
index++;
//if last file:
if (index > files.Count)
{
remaining = 0;
break;
}
//get ready for next round:
offset = 0;
fin.OpenRead(files[index]);
}
done += read;
offset += read;
remaining -= read;
}
//Doing the piece hash:
HashPiece(buffer);
//reset for next piece:
done = 0;
byte[] buffer = new byte[Math.min(remaining, pieceSize)];
}
}
void HashPiece(byte[] piece)
{
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
sha1.Initialize();
//hashes is a List
hashes.Add(sha1.ComputeHash(piece));
}
}
Thank you very much for your time and effort.
I’m not looking for completely coded solutions, any pointer and idea where to go with this would be excellent.
Questions & remarks to yodaj007’s answer:
Why if (currentChunk.Length >= Constants.CHUNK_SIZE_IN_BYTES)? Why not ==? If the chunk is larger than the chunk size, my SHA1 hash gets a different value.
currentChunk.Sources.Add(new ChunkSource()
{
Filename = fi.FullName,
StartPosition = 0,
Length = (int)Math.Min(fi.Length, (long)needed)
});
Is a really interesting idea. Postpone reading untill you need it. Nice!
chunks.Add(currentChunk = new Chunk());
Why do this in the if (currentChunk != null) block and in the for (int i = 0; i < (fi.Length - offset) / Constants.CHUNK_SIZE_IN_BYTES; i++) block? Isn’t the first a bit redundant?
Here is my complete answer. I tested it on one of my anime folders. It processes 14 files totaling 3.64GiB in roughly 16 seconds. In my opinion, using any sort of parallelism is more trouble than it is worth here. You’re being limited by disc I/O, so multithreading will only get you so far. My solution can be easily parallelized though.
It starts by reading "chunk" source information: source file, offset, and length. All of this is gathered very quickly. From here, you can process the "chunks" using threading however you wish. Code follows:
I don’t guarantee everything is spotlessly free of bugs. But I did have fun working on it. Look at the output window in Visual Studio for the debug information. It looks like this:
Here is the parallel version. It’s basically the same really. Using parallelism = 3 cut the processing time down to 9 seconds.
EDIT
I found a bug, or what I think is a bug. Need to set the read offset to 0 if we’re starting a new file.
EDIT 2 based on feedback
This processes the hashes in a separate thread. It’s necessary to throttle the I/O. I was running into
OutOfMemoryExceptionwithout doing so. It doesn’t really perform that much better, though. Beyond this… I’m not sure how it can be improved any further. Perhaps by reusing the buffers, maybe.