Was able to recreate a simpler scenario, see update near bottom First some backround

Question

0

Asked: May 30, 20262026-05-30T09:36:44+00:00 2026-05-30T09:36:44+00:00

Was able to recreate a simpler scenario, see update near bottom First some backround

0

Was able to recreate a simpler scenario, see update near bottom

First some backround into the problem. I’m doing some Cassandra experiments on Amazon EC2. I’ve got 4 nodes in East, 4 in West in one cluster. To simulate my use case, I used cassandras internal stress tool running on a separate East-EC2 instance to issue:

./stress -d us-eastnode1,…,us-eastnode4 –replication-strategy NetworkTopologyStrategy –strategy-properties us-east:3,us-west:3 -e LOCAL_QUORUM -c 200 -i 10 -n 1000000

Next I ran the same write, while also starting off a corresponding local_quorum read on another seperate West-EC2 instance:

./stress -d us-westnode1,…,us-westnode4 -o read -e LOCAL_QUORUM -c 200 -i 10 -n 1000000

After the first 300k or so reads, one of the west nodes started blocking with ~80% iowait cpu and lowering the total read speed by ~90%. Meanwhile the writes finished just fine at close to their normal speed. In an attempt to figure out what is causing this single node to iowait block, I started up just the reader, and had the same issue immediately.

My tokens are such that it is balanced around the East nodes, with each West node +1 over each corresponding East node, ie. us-eastnode1: 0, us-westnode1: 1, us-eastnode2: 42535295865117307932921825928971026432, etc.. The actual load ended up balanced across the set, so I struck that out of the possible cause for this.

I eventually ran a major compaction (Despite there being only 10 sstables for the CF, and no minor compactions having been kicked off for >hour). Once I tried the stress read again, the node was fine…However the next sequential node was then having the same problem. This is the biggest clue that I found, but I do not know where it leads.

I’ve asked in the cassandra IRC, but got no ideas from there. Anybody have any ideas for new things I could try in an attempt to figure out what is going wrong here?

Next day update
Some further delving I was able to reproduce this by simply running the write stress twice, then running the read. nodetool cfstats after the first write shows that each node is responsible for ~750k keys, which makes sense for 1,000,000 keys and RF:3 for 4 nodes in a DC. However, after the second stress write, us-westnode1 has ~1,500,000 keys while us-westnode1-3 each has ~875,000 keys. When it then tries to read, the node with twice as much load as it should have is bogging down.
This makes me think that the trouble is in the stress tool. It is overwriting the same 0000000-0999999 rows with the same c0-c199 columns. Yet somehow none of the nodes stay at roughly the same data load as they had the first run through.

Simple recreation
Narrowed down the problem by removing the second DC as a variable. Now running 1 DC, 4 nodes with 25% ownership each, RandomPartitioner, and the following write:

./stress -d node1,…,node4 –replication-factor 3 -e QUORUM -c 200 -i 10 -n 1000000

After one write (and minor compactions), each node had ~7.5gb of load.
After two writes (and minor compactions), each node had ~8.6gb of load, save for node2 with ~15gb.
After running a major compaction on all nodes, each node was back to ~7.5gb of load.

Is this simply a weird compaction issue that crops up when effectively overwriting the entire dataset like the stress tool does?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T09:36:45+00:00

Is this simply a weird compaction issue that crops up when effectively overwriting the entire dataset like the stress tool does?

Yes, compaction bucketing is going to behave somewhat randomly and it’s normal for some nodes to not compact as well as others. (That said, it sounds like node2 at essentially no compaction done was probably just behind.)

If your actual workload also involves a lot of overwrites, you should probably test Leveled Compaction, which is designed to do a better and more predictable job in that scenario: http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Was able to recreate a simpler scenario, see update near bottom First some backround

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply