Is it possible to invoke gnu parallel in a way that it would repeat the first line of original input to the STDIN of each child job?
I have a CSV file that contains a header line at the top. For example:
> cat large.csv
id,count
abc,123
def,456
I have a tool that can extract columns by name rather than position:
> csv_extract large.csv count
123
456
I can sum the values serially as:
> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579
The actual file I have is much larger, and the operation more complex than summing, but the same principles would apply. I’d like to use gnu parallel to process the file, but I don’t know if it is possible to tell gnu parallel to repeat the CSV header for each job.
Ideally I could run the operation with something like:
> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579
I’ve made up the –repeat-first-line option above to represent the functionality I cannot figure out. I’ve watched the YouTube videos, and read the man page, but I’m just not able to see how it can be done, if at all possible.
Thanks!
- danboo
Today you can
--skip-first-lineand add the header usingecho:In a future version you will have the option ‘–header’ which will be a regexp that matches the end of your header (e.g: ‘\n’ for one line or ‘\n.*\n’ for two lines or ‘—‘ for up to and including the first —)
— Edit —
Newest version of GNU Parallel can now do: