Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8030077
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T00:39:07+00:00 2026-06-05T00:39:07+00:00

We are looking at using HBase for real-time analytics. Prior to HBase, we will

  • 0

We are looking at using HBase for real-time analytics.

Prior to HBase, we will be running a Hadoop Map Reduce job over our log files and aggregating the data, and storing the fine-grained aggregate results in HBase to enable real-time analytics and queries on the aggregated data. So the HBase tables will have pre-aggregated data (by date).

My question is: how to best design the schema and primary key design for the HBase database to enable fast but flexible queries.

For example, assume that we store the following lines in a database:

timestamp, client_ip, url, referrer, useragent

and say our map-reduce job produces three different output fields, each of which we want to store in a separate “table” (HBase column family):

  • date, operating_system, browser
  • date, url, referrer
  • date, url, country

(our map-reduce job obtains the operating_system, browser and country fields from the user agent and client_ip data.)

My question is: how can we structure the HBase schema to allow fast, near-realtime and flexible lookups for any of these fields, or a combination? For instance, the user must be able to specify:

  • operating_system by date (“How many iPad users in this date range?”)
  • url by country and date (“How many users to this url from this country for the last month?”)

and basically any other custom query?

Should we use keys like this:

  • date_os_browser
  • date_url_referrer
  • date_url_country

and if so, can we fulfill the sort of queries specified above?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T00:39:08+00:00Added an answer on June 5, 2026 at 12:39 am

    You’ve got the gist of it, yes. Both of your example queries filter by date, and that’s a natural “primary” dimension in this domain (event reporting).

    A common note you’ll get about starting your keys with a date is that it will cause “hot spotting” problems; the essence of that problem is, date ranges that are contiguous in time will also be contiguous servers, and so if you’re always inserting and querying data that happened “now” (or “recently”), one server will get all the load while the others sit idle. This doesn’t sound like it’d be a huge concern on insert, since you’ll be batch loading exclusively, but it might be a problem on read; if all of your queries go to one of your 20 servers, you’ll effectively be at 5% capacity.

    OpenTSDB gets around this by prepending a 3-byte “metric id” before the date, and that works well to spray updates across the whole cluster. If you have something that’s similar, and you know you always (or usually) include a filter for it in most queries, you could use that. Or you could prepend a hash of some higher order part of the date (like “month”) and then at least your reads would be a little more spread out.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We are looking to build an internal real-time analytics system using MongoDB or HBase
I'm looking into using parallel unit tests for our project(s) and was wondering about
We're looking at using actors in our Scala code quite soon. We're also thinking
We're looking at using HTTPS in our ASP.NET webforms application for a shopping cart
I'm looking at using Elasticsearch to provide the search functions of our site. I've
I'm looking at using an NSCollectionView for a photo library application, which will allow
Client looking into using QR codes in print advertising that will reward the visitor
I was looking at using the EF for a project I will be starting
I'm looking into using Eclipse RCP on a new application where some widgets will
I've been looking into using the UIPageControl for a scrolling part of an application,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.