I’m working on a project aimed to analyze biometric data collected from various terminals. The process is not very performance critical. Rather it’s I/O bounded. Amount of data is very huge. (hundreds of millions records per table). Unfortunately database is relational. And there are 20 foreign keys. Changing values of referenced keys is very common during completion of job. So there will be lots of UPDATE and SET NULL s during collecting data.
Currently, semantics of database is designed. All programs are almost completed, and also a MySQL prototype for database is created. It works fine with sample (small-scale) data.
I do a search to find a suitable DBMS for the project. Googling around “DBMS comparisons” ,… didn’t help. People say antithesis things. Some say MySQL will perform faster inserts and updates, some say Oracle9 is better…
I can’t find any reliable, benchmark-based comparison between DBMS. I use MySQL in everyday projects, but this one looks more critical.
What we need:
- License and cost of DBMS is not important, but of course an open source (GPL or LGPL) is preferred (since whole project is will be published under LGPL).
- Very fast inserts, very fast updates, a lot of foreign keys is needed.
- DBMS should response to 0 – 100 connections at a time.
- Terminals are connected to server by a local network (LAN).
What I’m actually looking for, is a benchmark of various DBMS’s. It may contain charts, separated comparisons of different operations (insert, update, delete) in various situations (on a relation with referenced fields, or normal table)…
For this sort of answer, I would recommend PostgreSQL, Informix, or Oracle. PostgreSQL is open source (BSDL, GPL compatible, as everyone agrees). The reasons have to do with some aspects of data modelling that may be extremely helpful in your case. In general you have two important questions:
1) How far can I tune my db for what I am doing? How far can I scale it?
and
2) How can I model my data?
On the first, Oracle and PostgreSQL are more complex but more flexible. That flexibility may come in handy. On the second, the flexibility may save you a lot of effort later. Moreover it opens up new doors regarding optimization which are not possible in a straight relational model. First I would recommend looking at this: http://db.cs.berkeley.edu/papers/Informix/www.informix.com/informix/corpinfo/zines/whitpprs/illuswp/wave.htm as it will give you some background as to what I am thinking. Additionally, if you look at what Stonebraker is talking about you will see that straight benchmarks are really an apples to oranges comparison here.
The idea of going with an ORDBMS means a few important things:
PostgreSQL 9.2 will support up to approx 14000 writes per sec on sufficient hardware, which is nothing to sneeze at. Of course this depends on the width of the write, hardware performance on the server, etc. PostgreSQL is used by Affilias to manage the .org and .info top-level domains (web-scale!) and also by Skype’s infrastructure (still, even after Microsoft bought them).
Finally as a part of your information pipeline, if you are processing huge amounts of data and need to do some preprocessing before sending to PostgreSQL, you might look at array-native db’s (for a NoSQL approach common in scientific work) or VoltDB (for an in-memory store for high-throughput processing). Despite the fact that they are extremely different systems, VoltDB and Postgres were actually started by the same individual.
Finally regarding benchmark charts, the major db vendors more or less ban publication of such in their license agreements so you won’t find them.