I’m sorry in advance if this question is flawed. I’m pretty new to databases(I have set them up but not used them much in my development learning).
BackGround:
I have a process that generates alot of test data, its basically a hashtable with several hundred million records every day(but at the end of the day I can delete those records). Generating the data takes too long on one machine so I’m splitting the process up over several servers, which basically need to look up a database(or currently hashtable) and if it exists do some work and if it doesn’t exist then add it. I think(so far) my needs is a database that can handle the large amount of writes in a consistent way(i.e. updates should be avail. instantly) and the database should be able to effectically transfer this table over the network to other worker nodes(after the table is created another job runs that is based on it, but I don’t think a single server server a 10+ gig table to several servers is efficent so I was thinking it needs to be distributed).
Problem/Question:
If I use a NoSql solution, like Hbase(which I have a bit of experience setting up), will my application logic work? If I have 2 servers writing to a distributed database, is there any chance that server1 added an entry but when server2 looks it up it can’t find it because it hasn’t replicated though the cluster yet? Also, is there a better way to do what I’m trying to do? Would a single server(I also am considering just using mysql) with no distribution work better(I was avoiding it because I wanted a solution that if was too slow I could simply add more worker servers to write to a database, I’m not sure if my performance returns would diminish if I add 100 workers to write to a single server)?
Any tips or suggestions would be great.
Thanks!
Update: I just realized that facebook’s messaging infrastructure uses hbase. If it was not consistent that I would be getting crazy delays when messaging my friends. So how does hbase stay consistent(or is it really not consistent and facebook is so fast that it seems that way)?
HBase, in particular, has guaranteed consistency. This means that once a write operation has been completed, the data written will be available to all clients. This write operation, however, does not happen instantly, so that must be taken into account.
Other NoSQL database engines, such as Cassandra, support what is called “eventual consistency”, which trades absolute consistency for write speed. This means that a piece of data written to the cluster will EVENTUALLY be consistent across nodes, but it may take some time — typically this period of time is very short. More information on such a trade-off can be found here.
It is my supposition that you would prefer the guaranteed consistency of HBase.
This depends on what your records are going to look like. Could you provide more information on the data you’ll be storing? If your data fields cater to a document model — you typically require all of the fields when accessing data for a given key — then you could look into various document based data stores, such as MongoDB. MongoDB offers various levels of consistency (the default, rather conveniently, is to guarantee consistency like HBase).
If you will often times be looking for some subset of the fields stored per each key, then HBase will help minimize the amount of data you’re sending over the network by allowing you to specify which columns you wish to receive from a scan or get.
The distributed database engines will certainly perform better under concurrent reads/writes. Due to the aforementioned properties, HBase is considered to be strong in read heavy scenarios (writes aren’t live until they are syndicated) while Cassandra and other eventually consistent database engines are considered to be strong in write heavy scenarios (though Cassandra’s latest release has seen significant performance gains in reading).
A traditional database running on a single server will suffer when the read/write load increases, as it will have to queue incoming connections as well as disk operations once they have reached their perspective rate limits. I believe HBase (or MongoDB, should you decide a document store could work for you) would suit your needs for consistency the best.