How efficient are opensource distributed computation frameworks like Hadoop? By efficiency, I mean CPU cycles that can be used for the “actual job” in tasks that are mostly pure computation. In other words, how much CPU cycles are used for overhead, or wasted because of being not used? I’m not looking for specific numbers, just a rough picture. E.g. can I expect to use 90% of the cluster’s CPU power? 99%? 99.9%?
To be more specific, let’s say I want to calculate PI, and I have an algorithm X. When I perform this on a single core in a tight loop, let’s say I get some performance Y. If I do this calculation in a distributed fashion using e.g. Hadoop, How much performance degradation can I expect?
I understand this would depend on many factors, but what would be the rough magnitude? I’m thinking of a cluster with maybe 10 – 100 servers (80 – 800 CPU cores total), if that matters.
Thanks!
Technically hadoop has considerable overheads in several dimensions:
a) Per task overhead which can be estimated from 1 to 3 seconds.
b) HDFS Data reading overhead, due to passing data via socket and CRC calculation. It is harder to estimate
These overheads can be very significant if you have a lot of small tasks, and/or if your data processing is light.
In the same time if your have big files (less tasks) and Your data processing is heavy (let say a few mb/sec per core) then Hadoop overhead can be negleted.
In a bottom line – Hadoop overhead is variable things which higly depends on the nature of processing you are doing.