Current Process:
- I have a
tar.gzfile. (Actually, I have about 2000 of them, but that’s another story). - I make a temporary directory, extract the
tar.gzfile, revealing 100,000 tiny files (around 600 bytes each). - For each file, I cat it into a processing program, pipe that loop into another analysis program, and save the result.
The temporary space on the machines I’m using can barely handle one of these processes at once, never mind the 16 (hyperthreaded dual quad core) that they get sent by default.
I’m looking for a way to do this process without saving to disk. I believe the performance penalty for individually pulling files using tar -xf $file -O <targetname> would be prohibitive, but it might be what I’m stuck with.
Is there any way of doing this?
EDIT: Since two people have already made this mistake, I’m going to clarify:
- Each file represents one point in time.
- Each file is processed separately.
- Once processed (in this case a variant on Fourier analysis), each gives one line of output.
- This output can be combined to do things like autocorrelation across time.
EDIT2: Actual code:
for f in posns/*; do
~/data_analysis/intermediate_scattering_function < "$f"
done | ~/data_analysis/complex_autocorrelation.awk limit=1000 > inter_autocorr.txt
This sounds like a case where the right tool for the job is probably not a shell script. Python has a
tarfilemodule which can operate in streaming mode, letting you make only a single pass through the large archive and process its files, while still being able to distinguish the individual files (which thetar --to-stdoutapproach will not).