I have a reasonably large data set (10K files, each with 20K lines). I need to swap file and line, (giving myself 20K files, each with 10K lines).
I had a solution that combined it all into one massive table, and then extracted the columns with cut.. but cut was taking too long (scanning through a 4GB file 10K times isn’t exactly fast, even if the file is sitting in cache).
So I wrote a (surprisingly simple) once-through in awk:
awk '{ print >> "times/"FNR".txt" }' posns/*
This does the job, but is also rather slow (about 10s per input file). My guess is that it is doing field separation, despite the fact that I don’t need that at all. Is there a way to disable that feature to speed it up, or am I going to have to write up a solution in yet another language?
If it helps, while I’d prefer a general solution, each line in each file is of the form %d %lf %lf, so lines will be at most 21 bytes in this case (the floats are all less than 100, and the integer is 0 or 1).
Eventually I gave on the pretty shell method, and wrote another version in C. It’s sad, it’s not pretty, but it’s more than three orders of magnitude faster (at a total run time of 43 seconds, compared to an estimated 28 hours for the awk method, given pre-cached data). It requires changing ulimit to allow enough open files, and if your lines are longer than LINE_LENGTH, it will not work correctly.
Still, it runs 2300 times faster than the next best solution.
If someone stumbles upon this looking to do this task, this will do it. Just be careful and check that it actually worked.