I am considering developing an application with a Cassandra backend. I am hoping that I will be able to run each cassandra node on commodity hardware with the following specs:
- Quad Core 2GHz i7 CPU
- 2x 750GB disk drives
- 16 GB installed RAM
Now, I have been reading online that the available disk-space for Cassandra should be double the amount that is stored on the disks, which would mean that each node (set up in a RAID-1 configuration) would be able to store 375 GB of data, which is acceptable.
My question is this if 16GB RAM is enough to efficiently serve 375 GB of data per node. The data in the application developed will also be fairly time-dependant, such that recent data will be the data most read from the database. In fact, most of the data will be deleted after about 6 months.
Also, would I assign Cassandra a Heap (-Xmx) close to 16 GB, or does Cassandra utilize off-heap memory ?
You should not set the Cassandra heap to more than 8GB; bigger than that, and garbage collection will kill you with large pauses. Cassandra will use the buffer cache (like other applications) so the remaining memory isn’t wasted.
16GB of RAM will be enough to serve the data if your hot set will all fit in RAM, or if serving rate can be served off disk. Disks can do about 100 random IO/s, so with your setup if you need more than 200 reads / second you will need to make sure the data is in cache. Cassandra exports good cache statistics (cassandra-cli show keyspaces) so you should easily be able to tell how effective your cache is being.
Do bear in mind, with only two disks in RAID-1, you will not have a dedicated commit log. This could hamper write performance quite badly. You may want to consider turning off the commit log if it does affect performance, and forgo durable writes.