By default, Hadoop splits the files to be processed by a Mapper on the file’s block boundaries. That is, that’s what the FileInputFormat implementation does for getSplits(). Hadoop then makes sure that the blocks to be processed by a Mapper are replicated on the Datanode the Mapper runs on.
Now I’m wondering, if I need to read outside of this InputSplit (in a RecordReader, but that’s irrelevant), what does this cost me as opposed to reading inside the InputSplit – Assuming that the data outside of it is not present on the reading Datanode?
EDIT:
In other words:
I am a RecordReader and have been assigned an InputSplit that spans one file block. I have a local copy of this file block (rather, the datanode I’m running on does), but not the rest of the file. Now I do need to read outside of this InputSplit, because I need to read the file header which is at the very beginning. Then I need to skip across records in the file (by reading just the records headers which tells me how long each record is and than skipping that amount of bytes). I need to do this until I encounter the first record that’s inside the InputSplit. Then I can start reading the actual records within my InputSplit. That is the only way to make sure that I will start at a valid record boundary.
Question: When I do read outside of the InputSplit, when is the data from the non-local file blocks copied? Is this done one byte at a time (i.e. once per call of InputStream.read()), or is the entire file block (of the current InputStream position) copied to my local datanode once I call InputStream.read() until I encounter the next non-local file block, etc? I need to know this so I can estimate how much overhead will be produced by skipping through the file.
Thanks 🙂
In best of my understanding if data is not resided on local datanode – it will not be involved in reading it. HDFS client will ask NameNode where blocks are sitting and will directly speak with relevant datanodes in order to get the blocks.
So cost will be – on remote datanode : read from disk, calculate CRC, send to the network, on code reading data – get from the network.
I think cluster-wise price is only network bandwidth and some CPU spent on sending, receiving.