If one were to issue a sequential series of write(2) in Linux/Unix seperated by fdatasync(2) or fsync(2) or sync(2) is it guaranteed that the first write() will be committed to disk before your second write()? The following SO post seems to say that such guarantees cannot be given, since there are multiple caching layers involved. For database systems which guarantee consistency this seems to be important, since in WAL (Write Ahead Logging) recovery, you’d need your logs to be persisted on disk before actually changing your data, so that in the event of an application/system failure you can revert back to your last known consistent state. How is this ensured/implemented in an actual database system?
If one were to issue a sequential series of write(2) in Linux/Unix seperated by
Share
The
sync()system call is practically no help whatsoever; it promises to schedule the write-to-disk operations, but that’s about all.The normal technique used is to set the correct options when you
open()the file descriptor for the disk file:O_DSYNC,O_RSYNC,O_SYNC. However, thefsync()andfdatasync()get pretty close to the same effects. You can also look atO_DIRECTIOwhich is often supported, though it is not standardized at all by POSIX.Ultimately, the DBMS relies on the O/S to undertake that data written and synchronized to one disk is secure. As long as the device will always return what the DBMS last wrote, even if it is not on actual disk yet because of caching (because it is backed up in non-volatile cache, or something like that), then it isn’t critical. If, on the other, you have NAS (network attached storage) that doesn’t guarantee that what you last wrote (and were told was safe on disk) is returned when you read it, then your DBMS can suffer if it has to do recovery. So, you choose where you store your DBMS with care, making sure the storage works sensibly. If the storage does not work sufficiently like the hypothetical disk, you can end up losing data.