hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a large unbounded log file. Hadoop will split the file based on block size and save them on multiple data nodes. Is it guaranteed that each key-value pair will reside on a single block? or we may have a case so that key is in one block on node 1 and value(or parts of it) on second block on node 2? If we may have unmeaning-full splits, then what is the solution? sync markers?
Another question is: Does hadoop automatically write sync markers or we should write it manually?
I asked this question in hadoop mailing list. They answered:
then I asked:
They answered: