We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us.
- we need to store small JSON documents, about 150 bytes, e.g.
"audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}"or"journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}" - we are expecting about one million entries per day, about 150 MB data
- data will be stored and read but never modified
- data should stored in an efficient way, e.g. binary format used by Apache Avro
- after a retention time data may be deleted
- custom queries, such as
'get audit for user and time period'or'get journal for terminalid and time period' - replicated data base for failsafe
- scalable
Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best?
Are there better options?
One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)
Your requirement “after a retention time data may be deleted” fits Cassandra’s column TTLs nicely – when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.
“data should stored in an efficient way, e.g. binary format used by Apache Avro” – Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.
custom queries, such as ‘get audit for user and time period’ – in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query
or ‘get journal for terminalid and time period’ – as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.
Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)
Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous