I am learning to program in MPI and I came across this question. Lets say I have a .txt file with 100,000 rows/lines, how do I chunk them for processing by 4 processors? i.e. I want to let processor 0 take care of the processing for lines 0-25000, processor 1 to take care of 25001-50000 and so on. I did some searching and did came across MPI_File_seek but I am not sure can it work on .txt and supports fscanf afterwards.
Share
Text isn’t a great format for parallel processing exactly because you don’t know ahead of time where (say) line 25001 begins. So these sorts of problems are often dealt with ahead of time through some preprocessing step, either building an index or partitioning the file into the appropriate number of chunks for each process to read.
If you really want to do it through MPI, I’d suggest using MPI-IO to read in overlapping chunks of the text file onto the various processors, where the overlap is much longer than you expect your longest line to be, and then have each processor agree on where to start; eg, you could say that the first (or last) new line in the overlap region shared by processes N and N+1 is where process N leaves off and N+1 starts.
To follow this up with some code,
Running this on a narrow version of the text of the question, we get