Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7192891
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T20:02:58+00:00 2026-05-28T20:02:58+00:00

I have an HBase schema-design related question. The problem is fairly simple – I

  • 0

I have an HBase schema-design related question. The problem is fairly simple – I am storing “notifications” in hbase, each of which has a status (“new”, “seen”, and “read”). Here are the API’s I need to provide:

  • Get all notifications for a user
  • Get all “new” notifications for a user
  • Get the count of all “new” notifications for a user
  • Update status for a notification
  • Update status for all of a user’s notifications
  • Get all “new” notifications accross the database
  • Notifications should be scannable in reverse chronological order and allow pagination.

I have a few ideas, and I wanted to see if one of them is clearly best, or if I have missed a good strategy entirely. Common to all three, I think having one row per notification and having the user id in the rowkey is the way to go. To get chronological ordering for pagination, I need to have a reverse timestamp in there, too. I’d like to keep all notifs in one table (so I don’t have to merge sort for the “get all notificatiosn for a user” call) and don’t want to write batch jobs for secondary index tables (since updates to the count and status should be in real time).

The simplest way to do it would be (1) row key is “userId_reverseTimestamp” and do filtering for status on the client side. This seems naive, since we will be sending lots of unecessary data through the network.

The next possibility is to (2) encode the status into the rowkey as well, so either “userId_reverseTimestamp_status” and then doing rowkey regex filtering on the scans. The first issue I see is needing to delete a row and copy the notification data to a new row when status changes (which presumably, should happen exactly twice per notification). Also, since the status is the last part of the rowkey, for each user, we will be scanning lots of extra rows. Is this a big performance hit? Finally, in order to change status, I will need to know what the previous status was (to build the row key) or else I will need to do another scan.

The last idea I had is to (3) have two column families, one for the static notif data, and one as a flag for the status, i.e. “s:read” or “s:new” with ‘s’ as the cf and the status as the qualifier. There would be exactly one per row, and I can do a MultipleColumnPrefixFilter or SkipFilter w/ ColumnPrefixFilter against that cf. Here too, I would have to delete and create columns on status change, but it should be much more lightweight than copying whole rows. My only concern is the warning in the HBase book that HBase doesn’t do well with “more than 2 or 3 column families” – perhaps if the system needs to be extended with more querying capabilities, the multi-cf strategy won’t scale.

So (1) seems like it would have too much network overhead. (2) seems like it would have wasted cost spent copying data and (3) might cause issues with too many families. Between (2) and (3), which type of filter should give better performance? In both cases, the scan will have look at each row for a user, which presumably has mostly read notifications – which would have better performance. I think I’m leaning towards (3) – are there other options (or tweaks) that I have missed?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T20:02:59+00:00Added an answer on May 28, 2026 at 8:02 pm

    You have put a lot of thought into this and I think all three are reasonable!

    You want to have your main key be the username concatenated with the time stamp since most of your queries are “by user”. This will help with easy pagination with a scan and can fetch user information pretty quickly.

    I think the crux of your problem is this changing status part. In general, something like a “read” -> “delete” -> “rewrite” introduces all kinds of concurrency issues. What happens if your task fails between? Do you have data in an invalid state? Will you drop a record?

    I suggest you instead treat the table as “append only”. Basically, do what you suggest for #3, but instead of removing the flag, keep it there. If something has been read, it can have the three “s:seen”, “s:read” there (if it is new, we can just assume it is empty). You could also be fancy and put a timestamp in each of the three to show when that event was satisfied. You shouldn’t see much of a performance hit from doing this and then you don’t have to worry about concurrency, since all operations are write-only and atomic.

    I hope this is helpful. I’m not sure if I answered everything since your question was so broad. Please follow up with addition questions and I’ll love to elaborate or discuss something else.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have several Hbase tables. I wish to run a map task on each
Have just started using Visual Studio Professional's built-in unit testing features, which as I
I have used zohmg and successfully created mapper, table in HBase and test-imported my
Have created simple Ajax enabled contact forms before that have around 12 fields -
Have a simple console app where user is asked for several values to input.
I'm building an index of data, which will entail storing lots of triplets in
I have Installed hadoop and hbase cdh3u2. In hadoop i have a file at
I have billion of rows in hbase I want to scan million rows at
My situation is the following: I have a 20-node Hadoop/HBase cluster with 3 ZooKeepers.
If I use Hbase Cluster, does every slave have the same data or it

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.