My data input files are all of the same length, but, the records therein may span two files (starting at the end of the first file and finishing at the beginning of the second).
Is it possible to create an inputsplit that would allow me to span those two files?
Is it better to create an entirely new set of files so that records do not span more than one file?
I would definitely ensure your records do not span more than one file: you could, theoretically, write your own input format that takes care of this, but the overhead is likely to be considerable as you are – in having to ensure that you know which files belong together – taking over part of the responsiblity which the jobtracker and name node fulfill for you.
You should be free to tell the jobtracker/name node where the inputs are, and for the processing to be truly parallel, you don’t want to then have to take back some of that control: IMHO it would partially defeat the object of using haoop in the first place.