Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9031635
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T07:46:49+00:00 2026-06-16T07:46:49+00:00

I’m new to Hadoop. Recently I’m trying to process (only read) many small files

  • 0

I’m new to Hadoop. Recently I’m trying to process (only read) many small files on hdfs/hadoop. The average file size is about 1 kb and the number of files is more than 10M. The program must be written in C++ due to some limitations.

This is just a performance evaluation so I only use 5 machines for data nodes. Each of the data node have 5 data disks.

I wrote a small C++ project to read the files directly from hard disk(not from HDFS) to build the performance base line. The program will create 4 reading threads for each disk. The performance result is to have about 14MB/s per disk. Total throughput is about 14MB/s * 5 * 5 = 350MB/s (14MB/s * 5 disks * 5 machines ).

However, when this program ( still using C++, dynamically linked to libhdfs.so, creating 4*5*5=100 threads) reads files from hdfs cluster, the throughput is about only 55MB/s.

If this programming is triggered in mapreduce (hadoop streamming, 5 jobs, each have 20 threads, total number of threads is still 100), the throughput goes down to about 45MB/s. (I guess it’s slow down by some bookkeeping process).

I’m wondering what is the reasonable performance HDFS can prvoide. As you can see, comparing with native code, the data throughput is only about 1/7. Is it the problem of my config? Or HDFS limitation? Or Java limitation? What’s the best way for my scenario? Will sequence file help (much)? What is the reasonable throughput comparing to native IO read we can expect?

Here’s some of my config:

NameNode heap size 32G.

Job/Task node heap size 8G.

NameNode Handler Count: 128

DataNode Handler Count: 8

DataNode Maximum Number of Transfer Threads: 4096

1GBps ethernet.

Thanks.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T07:46:50+00:00Added an answer on June 16, 2026 at 7:46 am

    Lets try to understand our limits and see when we hit them
    a) We need namenode to give us information where files are sitting. I can assume that this number is around thousands per second. More information is here https://issues.apache.org/jira/browse/HADOOP-2149
    Assuming this number to be 10000K we should be able to get information about 10 MB second for 1K files. (somehow you get more…). may
    b) Overhead of HDFS. This overhead is mostly on latency not in throughput. HDFS can be tuned to serve a lot of files in parralel. HBase is doing it and we can take settings from HBase tuning guides. The question here is actually how much Datanodes you need
    c) Your LAN. You move data from the network so you might hit 1GB ethernet throughput limit. (i think it what you got.

    I also have to agree with Joe – that HDFS is not built for the scenario and you should use other technology (like HBase, if you like Hadoop stack) or compress files together – for example into sequence files.

    Regarding reading bigger files from HDFS – run DFSIO benchmark and it will be your number.
    In the same time – SSD on single host perfectly can be a solution also.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want use html5's new tag to play a wav file (currently only supported
I'm trying to convert HTML to plain text. I get many &\#8217; &\#8220; etc.
I am trying to render a haml file in a javascript response like so:
I have a reasonable size flat file database of text documents mostly saved in
I have thousands of HTML files to process using Groovy/Java and I need to
I have a .ini file as follows: [playlist] numberofentries=2 File1=http://87.230.82.17:80 Title1=(#1 - 365/1400) Example
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
I want to count how many characters a certain string has in PHP, but

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.