I am facing a certain behavior using Amazon EC2 and Java that it’s being hard to correctly understand. What I have is a code that uses iText to split a single, multi-page PDF file into many files (one file per page). I have about 1 million pages to extract (around 2500 source files), and thus I am doing tests on EC2 to determine which setup will work best for such job.
I have made a small application (link below) that either processes each source file sequentially, without starting any worker thread, and which also can perform the same task using Java threading via Executors.
On my local Macbook Pro the threaded version runs around 30~40% faster than the sequential one, but on every single EC2 instance that I tried, the threaded version performed much worst than the sequential run.
I tried with a small instance, a large and a high-cpu extra large. What I am trying to understand is what could cause such bad results for the threaded version; if it is something with my code, or I/O at EC2, or simply that for this specific task threads are indeed a bad choice? I am accepting any kind of clue.
The relevant code is here: https://gist.github.com/1641643 (sorry for the “flag oriented programming”, it was just easier to switch between the tests). I tried different values for Executors.newFixedThreadPool (2, 4, 8 etc…) without any significant changes in the results.
Wild guess, but if all the threads read and write to a single hard disk, it forces the disk to constantly change the location of the reads and writes. Whereas in the single-threaded approach, the thread can read the wole input file at once, and write the result at once.