I’m quicksorting over a very large amount of data, and for fun am trying to parallelize it to speed up the sort. However, in it’s current form, the multithreaded version is slower than the singlethreaded version due to synchronization chokepoints.
Every time I spawn a thread, I get a lock on an int and increment it, and every time the thread finishes I again get a lock and decrement, in addition to checking if there are any threads still running (int > 0). If not, I wake up my main thread and do work with the sorted data.
I’m sure there’s a better way to do this. Not sure what it is though. Help is greatly appreciated.
EDIT:
I guess I didn’t provide enough info.
This is a Java code on an octo-core Opteron. I can’t switch languages.
The amount I’m sorting fits into memory, and it already exists in memory at the point when quicksort is called, so there’s no reason to write it to disk only to read it back into memory.
By “get a lock” I mean having a synchronized block on the integer.
Without knowing more about the implementation here are my suggestions and/or comments:
Limit the number of threads that can run at any give time. Pergaps 8 or 10 (perhaps to give the scheduler more leeway, although it would best to put one per core/hw thread). There isn’t really a point running more threads for “throughput” on a CPU-bound problem if the affinity doesn’t support it.
Don’t thread near the leaves!!! Only thread on the larger branches. There is no point spawning a thread to sort a relatively few number of items and at this level there are many many little branches! Threading would add much more relative overhead here. (This is similar to switching to a “simple sort” for the leaves).
Make sure each thread can work in isolation — should not stomp on another thread during work -> no locks, just wait for join. Divide and conquer.
Possibly look at performing a “breadth-first” approach to spawn threads.
Consider a mergesort over a quicksort (I am biased towards mergesort 🙂 Remember there are numerous different kinds of mergesorts including bottom-up.
Edit
Edit (proof-of-concept):
I threw together this simple demonstration. On my Intel Core2 Duo @ 2Ghz I could get it run in about 2/3 to 3/4 the time which is definitely some improvement 🙂 (Settings: DATA_SIZE = 3000000, MAX_THREADS = 4, MIN_PARALLEL = 1000). This is with the basic in-place quicksort code ripped from Wikipedia that does not take advantage of any other basic optimizations.
The method in which it determines if a new thread can/should start is also very primitive — if no new thread is available it just chugs right along (because, you know, why wait?)
This code should also (hopefully) fan-out breadth-wise with the threads. This may be less efficient for data-locality than keeping it depth-wise, but the model seemed simple enough if my head.
An executor service is also used to simplify the design and to be able to reuse the same threads (vs. spawning new threads). MIN_PARALLEL can become quite small (say, about 20) before the executor overhead starts to show — the maximum number of threads and only-using-a-new-thread-if-possible likely keeps this in check as well.
I make absolutely no guarantees about the usefulness or correctness of this code but it “seems to work” here. Pay heed to the warning next to the ThreadPoolExecutor as it clearly shows I’m not entirely certain what is happening 🙂 I am fairly certain the design is somewhat flawed in under-utilizing the threads.
YMMV. Happy coding.
(Also, I’d like to know how this — or similar — code fairs on your 8-core box. Wikipedia claims a linear speedup by number-of-cpus is possible 🙂
Edit (better numbers)
Removed use of futures which caused a minor “jam” and switched to a single final-wait-semaphore: less useless waiting. Now runs in just 55% of non-threaded time 🙂
(