Just wondering if there is a faster way to split a file into N chunks other than unix “split”.
Basically I have large files which I would like to split into smaller chunks and operate on each one in parallel.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I assume you’re using
split -bwhich will be more CPU-efficient than splitting by lines, but still reads the whole input file and writes it out to each file. If the serial nature of the execution of this portion ofsplitis your bottleneck, you can useddto extract the chunks of the file in parallel. You will need a distinctddcommand for each parallel process. Here’s one example command line (assumingthe_input_fileis a large file this extracts a bit from the middle):To make this work you will need to choose appropriate values of
countandbs(those above are very small). Each worker will also need to choose a different value ofskipso that the chunks don’t overlap. But this is efficient;ddimplementsskipwith a seek operation.Of course, this is still not as efficient as implementing your data consumer process in such a way that it can read a specified chunk of the input file directly, in parallel with other similar consumer processes. But I assume if you could do that you would not have asked this question.