I’m new to Hadoop. Recently I’m trying to process (only read) many small files on hdfs/hadoop. The average file size is about 1 kb and the number of files is more than 10M. The program must be written in C++ due to some limitations.
This is just a performance evaluation so I only use 5 machines for data nodes. Each of the data node have 5 data disks.
I wrote a small C++ project to read the files directly from hard disk(not from HDFS) to build the performance base line. The program will create 4 reading threads for each disk. The performance result is to have about 14MB/s per disk. Total throughput is about 14MB/s * 5 * 5 = 350MB/s (14MB/s * 5 disks * 5 machines ).
However, when this program ( still using C++, dynamically linked to libhdfs.so, creating 4*5*5=100 threads) reads files from hdfs cluster, the throughput is about only 55MB/s.
If this programming is triggered in mapreduce (hadoop streamming, 5 jobs, each have 20 threads, total number of threads is still 100), the throughput goes down to about 45MB/s. (I guess it’s slow down by some bookkeeping process).
I’m wondering what is the reasonable performance HDFS can prvoide. As you can see, comparing with native code, the data throughput is only about 1/7. Is it the problem of my config? Or HDFS limitation? Or Java limitation? What’s the best way for my scenario? Will sequence file help (much)? What is the reasonable throughput comparing to native IO read we can expect?
Here’s some of my config:
NameNode heap size 32G.
Job/Task node heap size 8G.
NameNode Handler Count: 128
DataNode Handler Count: 8
DataNode Maximum Number of Transfer Threads: 4096
1GBps ethernet.
Thanks.
Lets try to understand our limits and see when we hit them
a) We need namenode to give us information where files are sitting. I can assume that this number is around thousands per second. More information is here https://issues.apache.org/jira/browse/HADOOP-2149
Assuming this number to be 10000K we should be able to get information about 10 MB second for 1K files. (somehow you get more…). may
b) Overhead of HDFS. This overhead is mostly on latency not in throughput. HDFS can be tuned to serve a lot of files in parralel. HBase is doing it and we can take settings from HBase tuning guides. The question here is actually how much Datanodes you need
c) Your LAN. You move data from the network so you might hit 1GB ethernet throughput limit. (i think it what you got.
I also have to agree with Joe – that HDFS is not built for the scenario and you should use other technology (like HBase, if you like Hadoop stack) or compress files together – for example into sequence files.
Regarding reading bigger files from HDFS – run DFSIO benchmark and it will be your number.
In the same time – SSD on single host perfectly can be a solution also.