Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8046607
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T05:50:25+00:00 2026-06-05T05:50:25+00:00

How Does HDFS store data? I want to store huge files in a compressed

  • 0

How Does HDFS store data?

I want to store huge files in a compressed fashion.

E.g : I have a 1.5 GB of file, with default replication factor of 3.

It requires (1.5)*3 = 4.5 GB of space.

I believe currently no implicit compression of data takes place.

Is there a technique to compress the file and store it in HDFS to save disk space ?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T05:50:26+00:00Added an answer on June 5, 2026 at 5:50 am

    HDFS stores any file in a number of ‘blocks’. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)

    So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.

    If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS – the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.

    However, one thing to consider when using compression is whether the compression method supports splitting the file – that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).

    Even if the method doesn’t support splitting, hadoop will still store the file in a number of blocks, but you’ll lose some benefit of ‘data locality’ as the blocks will most probably be spread around your cluster.

    In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.

    Hope this makes sense

    EDIT Some additional info in response to comments:

    When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:

    • setCompressOutput(Job, boolean)
    • setOutputCompressorClass(Job, Class)

    When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing
Does anyone know any way to update the resources (images, text files, UI .nib
Does OpenJPA have any support for batch insert similar to Hibernate ? I haven't
I have a file that contains java serialized objects like Vector. I have stored
Does any one have a solution to make the SpecFlow autocomplete in Visual Studio
Does anyone have any suggestions for the best / simplest way to view all
Does anyone have an example of a stored procedure which makes a connection to
If I copy data from local system to HDFS, сan I be sure that
Does the AppleWWDRCA.cer have any bearing on developing certificates using OpenSSL? If so, what?
Does anyone have a working, step-by-step example of how to implement IEnumerable and IEnumerator

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.