I have always read that Cassandra is good if your application changes frequently and

Question

0

Asked: June 15, 20262026-06-15T16:42:54+00:00 2026-06-15T16:42:54+00:00

I have always read that Cassandra is good if your application changes frequently and

0

I have always read that Cassandra is good if your application changes frequently and features are added frequently.

That makes sense, since you don’t have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.

However I have an hypotetical problem which I’m not able to solve.
Let’s say I have:

CREATE COLUMN FAMILY Students
    with comparator='CompositeType(UTF8Type,UTF8Type),
    and key_validation_class=UUIDType;

Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).

On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.

So I can ask “which courses are followed by XXX” and “which students follow course YYY”.

The problem is: what if I didn’t create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn’t a requirement.

This is a simple example, but I believe it’s quite common. “With Cassandra you plan CFs in terms of queries instead of relationships”. I need that query now, while at first it wasn’t needed.

Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T16:42:56+00:00

Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.

Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:

rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;    
cass_group = GROUP avgs BY group;   
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have always read that Cassandra is good if your application changes frequently and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply