I think my question would be best explained with an example. Say that you’re storing an image on HDFS. That image is large enough that it’s split into four separate, smaller files on HDFS. When you perform an operation that returns that image, does Hadoop return those 4 small files that can be combined back into the original image? Or does Hadoop automatically recombine the 4 small files back into the original?
Thank you!
Hadoop Distributed File System (HDFS) stores each file in one or more blocks (with each block being replicated one or more times).
For every file, you can configure the file block size and the replication factor (default values are used if not provided).
When you do any file based operation, you’re dealing with streams of data, the Name Node is the central repository mapping file paths to blocks and their locations (data nodes).
Using an example, say you have a file block size of 32 MB, and a 50MB file – this will be split into 2 blocks (32 MB & 18 MB). If the configured replication factor of the file is say 3, then the NameNode will try and ensure that each block is replicated to 3 data nodes in your cluster.
When you try and read from this file, you’re returned an FSInputStream, which like most input streams, you can seek to a certain byte position in the file. The DFSClient is abstracting you away from the details, but it knows for a particular byte offset, which block this relates to and seamlessly acquires the bytes (even as you step between block boundaries).
So to summarize and address your question – to the client reading from HDFS, it looks like one continuous input stream, but it’s actually 4 blocks stitched together as required