I remember reading somewhere that Hadoop’s performance deteriorates significantly if the machines it runs on are very different from one another, but I can’t seem to find that comment anymore. I am considering running a Hadoop cluster on an array of VMs that is not directly managed by my group, and I need to know if this is a requirement that I should put in my request.
So, should I insist on all of my machines having identical hardware, or is it okay to run on different machines in different hardware configurations?
Thanks.
Following papers describes how heterogeneous cluster affect the performance of hadoop map-reduce:
Following references has more details:
It also provides ways in which you could improve the performance on heterogeneous cluster or avoid this performance penalty.
It is wisely suggested that you have homogenous machines on your cluster but if these machines do not have wildly different specifications and performance difference, you should carry on with building your cluster.
For production systems, you should suggest for homogenous machines. For development, performance is not critical.
How ever, you should be able to benchmark your Hadoop cluster after you have built it.