Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6939859
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T12:40:42+00:00 2026-05-27T12:40:42+00:00

I have approximately 2500 tables involved in a calculation. In my dev environment I

  • 0

I have approximately 2500 tables involved in a calculation. In my dev environment I have very little data in these tables, 10 – 10,000 rows with most tables at the lower end of this range. My calculation will scan all these tables many times. Although the entire dataset would fit in memory easily accessing it through HBase is incredibly slow, with a huge amount of disk activity.

Do you think it would help to reduce the hdfs block size? My reasoning is that if each table is in its own block then a huge amount of memory would be wasted, preventing the entire dataset residing in RAM. A greatly reduced block size would allow the system to hold most if not all the data in RAM. Currently the block size is 64MB.

The final system will be used in larger cluster with far more memory and nodes, this is purely to speed up my dev environment.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T12:40:43+00:00Added an answer on May 27, 2026 at 12:40 pm

    HBase store its data in HFiles (which are in turn stored inside Hadoop files)
    here’s an excerpt from the doc:

    Minimum block size. We recommend a setting of minimum block size
    between 8KB to 1MB for general usage. Larger block size is preferred
    if files are primarily for sequential access. However, it would lead
    to inefficient random access (because there are more data to
    decompress). Smaller blocks are good for random access, but require
    more memory to hold the block index, and may be slower to create
    (because we must flush the compressor stream at the conclusion of each
    data block, which leads to an FS I/O flush). Further, due to the
    internal caching in Compression codec, the smallest possible block
    size would be around 20KB-30KB.

    regardless of the block size you may want to set the tables’ column families to be in-memory true which makes hbase favor keeping them in the cache.

    Lastly you situation seems to be more appropriate for a cache like redis/memcache than Hbase, but maybe I don’t have enough context

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have approximately 5,000 matrices with the same number of rows and varying numbers
I have a SQL Server 2000 database with approximately 220 tables. These tables have
I have an application which will have approximately 25,000 records when the initial data
I have a dataframe with approximately 500,000 rows and four columns. The dataframe contains
I have a MySQL table with approximately 3000 rows per user. One of the
I have approximately 10,000 records. Each records has 2 fields: one field is a
I have approximately 60.000 nodes in my Drupal installation. They are all unpublished, and
I have a database of approximately 8,000 records. Each record has 1 text field
We have a MyISAM table with approximately 75 milion rows that has 5 columns:
I have a directory of zip files (approximately 10,000 small files), within each is

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.