I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:
OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));
int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();
// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
KDDCupDataModel.getTestFile(dataFileDirectory))) {
PreferenceArray userTest = tests.getFirst();
callables.add(new Track1Callable(recommender, userTest));
i++;
}
ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();
for (Future<byte[]> result : results) {
for (byte estimate : result.get()) {
out.write(estimate);
}
}
out.flush();
out.close();
When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.
Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).
Anyhing wrong I am doing with the Executor framework or Futures?
Q: Does the order is the same as the one specified for the tasks? Yes.
From the API:
As for the “extra” bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).