I have approximately 2500 tables involved in a calculation. In my dev environment I

Question

0

Asked: May 27, 20262026-05-27T12:40:42+00:00 2026-05-27T12:40:42+00:00

I have approximately 2500 tables involved in a calculation. In my dev environment I

0

I have approximately 2500 tables involved in a calculation. In my dev environment I have very little data in these tables, 10 – 10,000 rows with most tables at the lower end of this range. My calculation will scan all these tables many times. Although the entire dataset would fit in memory easily accessing it through HBase is incredibly slow, with a huge amount of disk activity.

Do you think it would help to reduce the hdfs block size? My reasoning is that if each table is in its own block then a huge amount of memory would be wasted, preventing the entire dataset residing in RAM. A greatly reduced block size would allow the system to hold most if not all the data in RAM. Currently the block size is 64MB.

The final system will be used in larger cluster with far more memory and nodes, this is purely to speed up my dev environment.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T12:40:43+00:00

HBase store its data in HFiles (which are in turn stored inside Hadoop files)
here’s an excerpt from the doc:

Minimum block size. We recommend a setting of minimum block size
between 8KB to 1MB for general usage. Larger block size is preferred
if files are primarily for sequential access. However, it would lead
to inefficient random access (because there are more data to
decompress). Smaller blocks are good for random access, but require
more memory to hold the block index, and may be slower to create
(because we must flush the compressor stream at the conclusion of each
data block, which leads to an FS I/O flush). Further, due to the
internal caching in Compression codec, the smallest possible block
size would be around 20KB-30KB.

regardless of the block size you may want to set the tables’ column families to be in-memory true which makes hbase favor keeping them in the cache.

Lastly you situation seems to be more appropriate for a cache like redis/memcache than Hbase, but maybe I don’t have enough context

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have approximately 2500 tables involved in a calculation. In my dev environment I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply