I need to write a program in Java which will read a relatively large number (~50,000) files in a directory tree, process the data, and output the processed data in a separate (flat) directory.
Currently I have something like this:
private void crawlDirectoyAndProcessFiles(File directory) {
for (File file : directory.listFiles()) {
if (file.isDirectory()) {
crawlDirectoyAndProcessFiles(file);
} else {
Data d = readFile(file);
ProcessedData p = d.process();
writeFile(p,file.getAbsolutePath(),outputDir);
}
}
}
Suffice to say that each of those methods is removed and trimmed down for ease of reading, but they all work fine. The whole process works fine, except that it is slow. The processing of data occurs via a remote service and takes between 5-15 seconds. Multiply that by 50,000…
I’ve never done anything multi-threaded before, but I figure I can get some pretty good speed increases if I do. Can anyone give some pointers how I can effectively parallelise this method?
I would use a ThreadPoolExecutor to manage the threads. You can do something like this:
You would obtain an Executor using:
where
poolSizeis the maximum number of threads you want going at once. (It’s important to have a reasonable number here; 50,000 threads isn’t exactly a good idea. A reasonable number might be 8.) Note that after you’ve queued all the files, your main thread can wait until things are done by callingexecutor.awaitTermination.