Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4541948
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T15:19:04+00:00 2026-05-21T15:19:04+00:00

I have an application that writes several billion records into Cassandra and removes duplicates

  • 0

I have an application that writes several billion records into Cassandra and removes duplicates by key. Then it groups them by other fields, such as title, in successive phases so that further processing can be done on groups of similar records. The application is distributed over a cluster of machines because I need it to finish in a reasonable time (hours not weeks).

One phase of the application works by writing the records into Cassandra using the hector client, and storing the records in a column family with the records’ primary keys as the Cassandra keys. The timestamp is set to the record’s last update date so that I only get the latest record for each key.

Later phases need to read everything back out of Cassandra, perform some processing on the records, and add the records back to a different column family using various other keys, so that the records can be grouped.

I accomplished this batch reading by using Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits().

Once I have a bunch of nice evenly sized splits that can be read from the local Cassandra instance I start reading them as fast as I can using Cassandra.Client.get_range_slices() with ConsistencyLevel.ONE so that it doesn’t need to do any remote reads. I fetch 100 rows at a time, sequentially through the whole TokenRange (I have tried various batch sizes and 100 seems to work best for this app).

This all worked great on Cassandra 0.7.0 with a little bit of tuning to memory sizes and column family configs. I could read between 4000 and 5000 records per second in this way, and kept the local disks working about as hard as they could.

Here is an example of the splits and the speed I would see under Cassandra 0.7.0:

10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 20253030905057371310864605462970389448 : 21603066481002044331198075418409137847
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 21603066481002044331198075418409137847 : 22954928635254859789637508509439425340
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 22954928635254859789637508509439425340 : 24305566132297427526085826378091426496
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 24305566132297427526085826378091426496 : 25656389102612459596423578948163378922
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 25656389102612459596423578948163378922 : 27005014429213692076328107702662045855
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 27005014429213692076328107702662045855 : 28356863910078000000000000000000000000
10/12/20 20:13:18 INFO m4.TagGenerator: 42530 records read so far at a rate of 04250.87/s
10/12/20 20:13:28 INFO m4.TagGenerator: 90000 records read so far at a rate of 04498.43/s
10/12/20 20:13:38 INFO m4.TagGenerator: 135470 records read so far at a rate of 04514.01/s
10/12/20 20:13:48 INFO m4.TagGenerator: 183946 records read so far at a rate of 04597.16/s
10/12/20 20:13:58 INFO m4.TagGenerator: 232105 records read so far at a rate of 04640.62/s

When I upgraded to Cassandra 0.7.2 I had to rebuild the configs because there were a few new options and such, but I took care to try and get all of the relevant tuning settings the same from the 0.7.0 configs that worked. However I can barely read 50 records per second with The new version of Cassandra.

Here is an example of the splits and the speed I see now under Cassandra 0.7.2:

21:02:29.289 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 50626015574749929715914856324464978537 : 51655803550438151478740341433770971587
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 51655803550438151478740341433770971587 : 52653823936598659324985752464905867108
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 52653823936598659324985752464905867108 : 53666243390660291830842663894184766908
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 53666243390660291830842663894184766908 : 54679285704932468135374743350323835866
21:02:29.290 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 54679285704932468135374743350323835866 : 55681782994511360383246832524957504246
21:02:29.291 [main] INFO  c.p.m.a.batch.BulkCassandraReader - split - 55681782994511360383246832524957504246 : 56713727820156410577229101238628035242
21:09:06.910 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 100 records read so far at a rate of 00000.25/s
21:13:00.953 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 10100 records read so far at a rate of 00015.96/s
21:14:53.893 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 20100 records read so far at a rate of 00026.96/s
21:16:37.451 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 30100 records read so far at a rate of 00035.44/s
21:18:35.895 [Thread-0] INFO  c.p.m.assembly.batch.TagGenerator - 40100 records read so far at a rate of 00041.44/s

As you can probably see from the logs the Code moved to a different package but other than that the code has not changed. It is running on the same hardware, and all memory settings are the same.

I could see some performance difference between versions of Cassandra, but something as earth shattering as this (100x performance drop) seems like I must be missing something fundamental. Even before tuning the column families and memory settings on 0.7.0 it was never THAT slow.

Does anyone know what could account for this? Is there some tuning setting that I might be missing that would be likely to cause this? Did something change with the Cassandra functions to support hadoop that is just undocumented? Reading through release notes I just can’t find anything that would explain this. Any help on fixing this, or even just an explanation of why it may have stopped working would be appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T15:19:04+00:00Added an answer on May 21, 2026 at 3:19 pm

    I figured I should close the loop on this since we got to the bottom of the issue and the problem was not a Cassandra issue but a configuration issue.

    When we upgraded to 0.7.2 one piece of configuration that changed, and I missed, was the token ring. On our 0.7.0 configuration we had the first token as 2^127 / 12, and in our 0.7.2 configuration we had the first token as 0. This resulted in one node getting the split 0:0. 0:0 seems to be a magical range that asks Cassandra for everything. So we had one node in the cluster pulling all the data over the network. The network traffic to that node is what ultimately led us to the root of the problem.

    The fix was to correct the code to check for the 0:0 case and handle it, so the code will now handle Cassandra clusters partitioned either way (first node as 0 or other).

    So in short not a Cassandra issue. Configuration issue on my part.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Currently i have an application that reads and writes several properties from one or
I'm writing an application that consists of several maven modules. All of them have
I have an application that writes to another application and needs to provide the
I have an application that writes information to file. This information is used post-execution
I have created an application that writes some data to the root folder of
I have written a standalone Java application that I've packaged into a jar file
I have several .NET Windows Forms applications that I'm preparing to convert into a
We have an application wrote in C#, which broken into several projects. These projects
I have an application that allows users to write their own code in a
I have a console application project in C# 2.0 that needs to write something

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.