We’re strategizing on how to analyze user interest (clicks, likes, etc) on 1M+ items

Question

0

Asked: June 9, 20262026-06-09T07:21:15+00:00 2026-06-09T07:21:15+00:00

We’re strategizing on how to analyze user interest (clicks, likes, etc) on 1M+ items

0

We’re strategizing on how to analyze user “interest” (clicks, likes, etc) on 1M+ items on our site to generate a “similar items” list.

In order to process a large amount of raw data we’re learning about Hadoop, Hive, and related projects.

My question is regarding this concern: Hadoop/Hive and the like seem to be geared more towards data dumps, followed by processing cycles. Presumably the end of the processing cycle is something to the extend of an indexed graph of links between related items.

If I’m on track so far, how is data typically processed in these scenarios: I.e.

Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?
Do we stream data as it comes in, analyze it and update the data store?
As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?
Is this use case better addressed by Cassandra than Hive/HDFS?

I’m looking to better understand the common approach to this kind of big data processing.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T07:21:18+00:00

I think this is a good use case for Hadoop family of tools.
It looks to me like HDFS and Flume might be obvious choices, I would look into either HBase or Hive depending on what kinds of analysis you are interested in, how flexible you are in organizing the data
and querying it.

Is the raw user data re-analyzed at intervals to re-build an indexed graph of links?

Answer: Hadoop is very good for this. I would use HBase for this, but there are other choices.

Do we stream data as it comes in, analyze it and update the data store?

Answer: Flume is good for this.

As the resultant data from the analysis changes, are we typically updating it piece by piece, or re-processing in bulk?

Answer: You have options to do both. Bulk would probably be a MapReduce job on HDFS where piece-by-piece could be managed through HBase column-family values or Hive rows. If you give more details, I could be more precise.

Is this use case better addressed by Cassandra than Hive/HDFS?

Answer: Cassandra and HBase are both implementations of Google’s BigTable. I think that choice depends on
how do you need to organize, access, analyze and update data. I can provide more guidance if needed.
HBase is usually better for semi-structured, high R/W processing.

DHFS is generally good choice for flexible, scalable storage of data dumps as you call them.
Flume is applicable for moving streaming data.

I would also consider looking into Titan and HBase if you are thinking graph.

Hive would be applicable if you are interested in tabular-oriented data and using SQL-like queries.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

We’re strategizing on how to analyze user interest (clicks, likes, etc) on 1M+ items

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply