I almost do not know anything about HBase. Sorry for basic questions.
Imagine I have a table of 100 billion rows with 10 int, one datetime, and one string column.
- Does HBase allow querying this table and Group the result based on key (even a composite key)?
- If so, does it have to run a map/reduce job to it?
- How do you feed it the query?
- Can HBase in general perform real-time like queries on a table?
Data aggregation in HBase intersects with the “real time analytics” need. While HBase is not built for this type of functionality there is a lot of need for it. So the number of ways to do so is / will be developed.
1) Register HBase table as an external table in Hive and do aggregations. Data will be accessed via HBase API what is not that efficient. Configuring Hive with Hbase this is a discussion about how it can be done.
It is the most powerful way to group by HBase data. It does imply running MR jobs but by Hive, not by HBase.
2) You can write you own MR job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API – instead it will be accesses right from HDFS in sequential manner.
3) Next version of HBase will contain coprocessors which would be able to aggregations inside specific regions. You can assume them to be a kind of stored procedures in the RDBMS world.
4) In memory, Inter-region MR job which will be parralelized in one node is also planned in the future HBase releases. It will enable somewhat more advanced analytical processing then coprocessors.