The project requires storing binary data into PostgreSQL (project requirement) database. For that purpose we made a table with following columns:
id : integer, primary key, generated by client
data : bytea, for storing client binary data
The client is a C++ program, running on Linux.
The rows must be inserted (initialized with a chunk of binary data), and after that updated (concatenating additional binary data to data field).
Simple tests have shown that this yields better performance.
Depending on your inputs, we will make client use concurrent threads to insert / update data (with different DB connections), or a single thread with only one DB connection.
We haven’t much experience with PostgreSQL, so could you help us with some pointers concerning possible bottlenecks, and whether using multiple threads to insert data is better than using a single thread.
Thank you 🙂
Edit 1:
More detailed information:
- there will be only one client accessing the database, using only one Linux process
- database and client are on the same high performance server, but this must not matter, client must be fast no matter the machine, without additional client configuration
- we will get new stream of data every 10 seconds, stream will provide new 16000 bytes per 0.5 seconds (CBR, but we can use buffering and only do inserts every 4 seconds max)
- stream will last anywhere between 10 seconds and 5 minutes
It makes extremely little sense that you should get better performance inserting a row then appending to it if you are using
bytea.PostgreSQL’s MVCC design means that an
UPDATEis logically equivalent to aDELETEand anINSERT. When you insert the row then update it, what’s happening is that the original tuple you inserted is marked as deleted and new tuple is written that contains the concatentation of the old and added data.I question your testing methodology – can you explain in more detail how you determined that insert-then-append was faster? It makes no sense.
Beyond that, I think this question is too broad as written to really say much of use. You’ve given no details or numbers; no estimates of binary data size, rowcount estimates, client count estimates, etc.
byteainsert performance is no different to any other insert performance tuning in PostgreSQL. All the same advice applies: Batch work into transactions, use multiple concurrent sessions (but not too many; rule of thumb is number_of_cpus + number_of_hard_drives) to insert data, avoid having transactions use each others’ data so you don’t needUPDATElocks, use async commit and/or a commit_delay if you don’t have a disk subsystem with a safe write-back cache like a battery-backed RAID controller, etc.Given the updated stats you provided in the main comments thread, the amount of data you want to consume sounds entirely practical with appropriate hardware and application design. Your peak load might be achievable even on a plain hard drive if you had to commit every block that came in, since it’d require about 60 transactions per second. You could use a
commit_delayto achieve group commit and significantly lower fsync() overhead, or even usesynchronous_commit = offif you can afford to lose a time window of transactions in case of a crash.With a write-back caching storage device like a battery-backed cache RAID controller or an SSD with reliable power-loss-safe cache, this load should be easy to cope with.
I haven’t benchmarked different scenarios for this, so I can only speak in general terms. If designing this myself, I’d be concerned about checkpoint stalls with PostgreSQL, and would want to make sure I could buffer a bit of data. It sounds like you can so you should be OK.
Here’s the first approach I’d test, benchmark and load-test, as it’s in my view probably the most practical:
One connection per data stream,
synchronous_commit = off+ acommit_delay.INSERTeach 16kb record as it comes in into a staging table (if possibleUNLOGGEDorTEMPORARYif you can afford to lose incomplete records) and let Pg synchronize and group up commits. When each stream ends, read the byte arrays, concatenate them, and write the record to the final table.For absolutely best speed with this approach, implement a
bytea_aggaggregate function forbyteaas an extension module (and submit it to PostgreSQL for inclusion in future versions). In reality it’s likely you can get away with doing the bytea concatenation in your application by reading the data out, or with the rather inefficient and nonlinearly scaling:You would want to be sure to tune your checkpointing behaviour, and if you were using an ordinary or
UNLOGGEDtable rather than aTEMPORARYtable to accumulate those 16kb records, you’d need to make sure it was being quite aggressivelyVACUUMed.See also: