I am currently struggling with the correct data format to use with Cassandra. I guess this is because of the additional depth it offers over standard key-value storages.
My data format is currently defined like this:
- Keyspaces for different Applications.
- Column Families for different Application parts.
- In these Column Families I have the data.
Most of the data is stored within a single Column Family in the format:
Key: UUID-1|UUID-2|UUID-3
Value: Array of PHP Values
After inserting several 100.000 entries (<1kb each) I see a performance degradation when reading data.
From my understanding the Column Families should be exactly where to store the main part of my data. Having most of my data in a single Column Family instead of several different ones should not be the point.
Should I look into splitting my data into different Column Families or is the approach correct but something else likely to be the reason for the problem?
Edit to answer DNA’s questions in the comment:
I am comparing the read time needed for a single key I have inserted before starting my tests.
The test key consistently read within <0.0010s for >1.000 times in the beginning while the database is still empty. The data written in the tests is structured like this:
- A row identified by a Key built with 5 chars + 20 numbers
- with one Column (1 Character) containing the current unix timestamp
I added entries and re-ran the same read test to compare how the read times. The read times I am listing here are the lower numbers:
Entries | Read Time
0 | 0.0010
150.000 | 0.0013
300.000 | 0.0014
500.000 | 0.0016
750.000 | 0.0019
1.000.000 | 0.0022
Because this is only for basic testing this is only run on a single node (ec2 instance) at Amazon. The read time seems to increase by about 0.0003s for every 250.000 new rows.
I know that these are really small numbers and they are great, but the linear growing of the read time is not what I expected.
I am planning to move a big MySQL Server with a huge amount of small entries to Cassandra. It currently contains about 75 billion entries and the amount of new datasets it is collecting is really fast, a linear increase for read time is therefore making me wonder if am going into the right direction.
Thanks for updating the question.
You should probably read this article about the Netflix benchmarking.
Benchmarking with relatively small numbers of rows won’t tell you anything about the scalability for large datasets. It’s not difficult to run this kind of test for many millions of rows.
If you are just testing at the moment, you should probably upgrade to the 1.0 branch (currently 1.0.7) as this is significantly faster than 0.7.
Performance on cloud servers may not be very representative of the performance on real local hardware – although cloud servers are a great idea for cluster testing. See http://wiki.apache.org/cassandra/CassandraHardware
If read latency is your key concern, then make sure you are familiar with the cache settings in Cassandra (keys_cached and rows_cached) – see http://wiki.apache.org/cassandra/StorageConfiguration, for example.