I want to split large, compressed CSV files into multiple smaller gzip file, split on line boundary.
I’m trying to pipe gunzip to a bash script with a while read LINE. That script writes to a named pipe where a background gzip process is recompressing it. Every X characters read I close the FD and restart a new gzip process for the next split.
But in this scenario the script, with while read LINE, is consuming 90% of the cpu because read is so inefficient here (I understand that it makes a system call to read 1 char at a time).
Any thoughts on doing this efficiently? I would expect gzip to consume the majority cpu.
Use
splitwith the-loption to specify how many lines you want. Use--filteroption$FILEis the name split would have used for output to file (and has to be quoted with single quotes to prevent expanding by the shell too early:If you need any additional processing, just pen a script, that will accept the filename as argument and process standard input accordingly, and use that instead of plain
gzip.