I’m trying to think how should I utilize threads in my program.
Right now I have a single threaded program that reads a single huge file. Very simple program, just reads line by line and collects some statistics about the words.
Now, I would like to use multi threads to make it faster. I’m not sure how to approach this.
One solution is to separate the data into X pieces in advance, then have X threads, each runs on one piece simultaneously, with one sync method to write the stats to memory. Is there a better approach? specifically, I would like to avoid separating the data in advance.
Thanks!
First of all, do some profiling to make sure that your process is actually compute-bound and not I/O bound. That is, that your statistics collection is slower than accessing the file. Otherwise, multi-threading will slow your program, not speed it, particularly if you are running on a single-core CPU (or an ancient JVM).
Also consider: if your file resides on a hard drive: how will you schedule reads? You risk adding hard drive seek delays otherwise, stalling all threads that have managed to finish their chunk of work, while one thread is asking the hard drive to seek to position 0x03457000…