I have an application that writes several billion records into Cassandra and removes duplicates

Question

0

Asked: May 21, 20262026-05-21T15:19:04+00:00 2026-05-21T15:19:04+00:00

I have an application that writes several billion records into Cassandra and removes duplicates

0

I have an application that writes several billion records into Cassandra and removes duplicates by key. Then it groups them by other fields, such as title, in successive phases so that further processing can be done on groups of similar records. The application is distributed over a cluster of machines because I need it to finish in a reasonable time (hours not weeks).

One phase of the application works by writing the records into Cassandra using the hector client, and storing the records in a column family with the records’ primary keys as the Cassandra keys. The timestamp is set to the record’s last update date so that I only get the latest record for each key.

Later phases need to read everything back out of Cassandra, perform some processing on the records, and add the records back to a different column family using various other keys, so that the records can be grouped.

I accomplished this batch reading by using Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits().

Once I have a bunch of nice evenly sized splits that can be read from the local Cassandra instance I start reading them as fast as I can using Cassandra.Client.get_range_slices() with ConsistencyLevel.ONE so that it doesn’t need to do any remote reads. I fetch 100 rows at a time, sequentially through the whole TokenRange (I have tried various batch sizes and 100 seems to work best for this app).

This all worked great on Cassandra 0.7.0 with a little bit of tuning to memory sizes and column family configs. I could read between 4000 and 5000 records per second in this way, and kept the local disks working about as hard as they could.

Here is an example of the splits and the speed I would see under Cassandra 0.7.0:

10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 20253030905057371310864605462970389448 : 21603066481002044331198075418409137847
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 21603066481002044331198075418409137847 : 22954928635254859789637508509439425340
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 22954928635254859789637508509439425340 : 24305566132297427526085826378091426496
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 24305566132297427526085826378091426496 : 25656389102612459596423578948163378922
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 25656389102612459596423578948163378922 : 27005014429213692076328107702662045855
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 27005014429213692076328107702662045855 : 28356863910078000000000000000000000000
10/12/20 20:13:18 INFO m4.TagGenerator: 42530 records read so far at a rate of 04250.87/s
10/12/20 20:13:28 INFO m4.TagGenerator: 90000 records read so far at a rate of 04498.43/s
10/12/20 20:13:38 INFO m4.TagGenerator: 135470 records read so far at a rate of 04514.01/s
10/12/20 20:13:48 INFO m4.TagGenerator: 183946 records read so far at a rate of 04597.16/s
10/12/20 20:13:58 INFO m4.TagGenerator: 232105 records read so far at a rate of 04640.62/s

When I upgraded to Cassandra 0.7.2 I had to rebuild the configs because there were a few new options and such, but I took care to try and get all of the relevant tuning settings the same from the 0.7.0 configs that worked. However I can barely read 50 records per second with The new version of Cassandra.

Here is an example of the splits and the speed I see now under Cassandra 0.7.2:

21:02:29.289 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 50626015574749929715914856324464978537 : 51655803550438151478740341433770971587
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 51655803550438151478740341433770971587 : 52653823936598659324985752464905867108
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 52653823936598659324985752464905867108 : 53666243390660291830842663894184766908
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 53666243390660291830842663894184766908 : 54679285704932468135374743350323835866
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 54679285704932468135374743350323835866 : 55681782994511360383246832524957504246
21:02:29.291 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 55681782994511360383246832524957504246 : 56713727820156410577229101238628035242
21:09:06.910 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 100 records read so far at a rate of 00000.25/s
21:13:00.953 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 10100 records read so far at a rate of 00015.96/s
21:14:53.893 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 20100 records read so far at a rate of 00026.96/s
21:16:37.451 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 30100 records read so far at a rate of 00035.44/s
21:18:35.895 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 40100 records read so far at a rate of 00041.44/s

As you can probably see from the logs the Code moved to a different package but other than that the code has not changed. It is running on the same hardware, and all memory settings are the same.

I could see some performance difference between versions of Cassandra, but something as earth shattering as this (100x performance drop) seems like I must be missing something fundamental. Even before tuning the column families and memory settings on 0.7.0 it was never THAT slow.

Does anyone know what could account for this? Is there some tuning setting that I might be missing that would be likely to cause this? Did something change with the Cassandra functions to support hadoop that is just undocumented? Reading through release notes I just can’t find anything that would explain this. Any help on fixing this, or even just an explanation of why it may have stopped working would be appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T15:19:04+00:00

I figured I should close the loop on this since we got to the bottom of the issue and the problem was not a Cassandra issue but a configuration issue.

When we upgraded to 0.7.2 one piece of configuration that changed, and I missed, was the token ring. On our 0.7.0 configuration we had the first token as 2^127 / 12, and in our 0.7.2 configuration we had the first token as 0. This resulted in one node getting the split 0:0. 0:0 seems to be a magical range that asks Cassandra for everything. So we had one node in the cluster pulling all the data over the network. The network traffic to that node is what ultimately led us to the root of the problem.

The fix was to correct the code to check for the 0:0 case and handle it, so the code will now handle Cassandra clusters partitioned either way (first node as 0 or other).

So in short not a Cassandra issue. Configuration issue on my part.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an application that writes several billion records into Cassandra and removes duplicates

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply