Background
A lot of work has gone into optimizing database design, especially in the realm of the most optimal ways to read and write data from disks (both spindle and SSD).
The knowledge that has come out of the work suggests that reading and writing on block boundaries, matching the block sizes of the filesystem you are running on, is the most optimal approach.
Question
Say I am operating in a relatively low-memory environment and want to use a small 32MB memory-mapped file to read and write the contents of a huge 500GB file.
If I were using Java’s NIO mechanisms, specifically the MappedByteBuffer (Java’s memory-mapped file mechanism), would I need to take care to execute READ and WRITE operations on block boundaries (e.g. 4KB) into memory before pairing out the data I needed, or can I just issue R/W ops at any location I want and allow the operating system, VM paging logic, filesystem and storage firmware handle the optimization of the operations and culling of additional block data I didn’t need as-needed?
Additional Detail
The reason for the question is in database design, I see this obsessive focus on block-optimization to the point that there doesn’t seem to exist a world where you would ever just read and write data without the concept of a block.
What confuses me is that the filesystem is the one enforcing the block units of operation, why would my higher level app need to worry about this then? If I want the 17,631 bytes at offset 71, can’t I just grab them and read them in, or is it really faster for me to figure out that
the read operation starts at block 0 and falls across the boundaries of blocks 0, 1 and 2… read all of those 3 blocks in to an internal byte[], then cull out the 17,631 bytes I wanted in the first place?
If the literature on DB design wasn’t so religious about this block idea, the question would have never come up in my mind, but because it is, I am wondering if I am missing a critical detail here WRT filesystems and optimal block device I/O.
Thank you for reading.
I think part of the reason databases have awareness of a block size (which may not be exactly the same as the fs block size, but of course should align) is not just to perform block-aligned I/O, but also to manage how the disk data is cached in memory rather than just relying on the OS caching. Some databases bypass the OS filesystem cache completely, in fact. Having the database manage the cache sometimes allows greater intelligence as to how that cache is utilised, that the OS might not be able to provide.
An rdbms will typically take account of the number of blocks that could be read/written during a query in order to compare different execution plans: and the possibilities for all the data to be fetched from the same block can be a useful optimisation to take note of.
Most databases I’m familiar with have the concept of a block cache/buffer where some portion of the working set of the database lives. Managing a cache entirely made up of arbitrary extents could potentially be quite a bit harder to manage. Also many databases actually arrange their stored data as a sequence of blocks, so the I/O pattern grows out of that. Of course, this might simply be a legacy of databases originally written for platforms that didn’t have rich OS caching facilities…
Trying to conclude this ramble with some sort of answer to your question… my feeling would be that reading from arbitrary extents within the mapped file and letting the OS deal with the extra slop should be fine. Performance-wise, it’s probably more important to try and let the OS do read-ahead: e.g. using the “advise” calls so the OS can start reading the next extent from disk while you process the current one. And, of course, a way to advise the OS to uncache extents you’ve finished with.