Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8966449
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T17:01:32+00:00 2026-06-15T17:01:32+00:00

I have a reasonably large data set (10K files, each with 20K lines). I

  • 0

I have a reasonably large data set (10K files, each with 20K lines). I need to swap file and line, (giving myself 20K files, each with 10K lines).

I had a solution that combined it all into one massive table, and then extracted the columns with cut.. but cut was taking too long (scanning through a 4GB file 10K times isn’t exactly fast, even if the file is sitting in cache).

So I wrote a (surprisingly simple) once-through in awk:

awk '{ print >> "times/"FNR".txt" }' posns/*

This does the job, but is also rather slow (about 10s per input file). My guess is that it is doing field separation, despite the fact that I don’t need that at all. Is there a way to disable that feature to speed it up, or am I going to have to write up a solution in yet another language?

If it helps, while I’d prefer a general solution, each line in each file is of the form %d %lf %lf, so lines will be at most 21 bytes in this case (the floats are all less than 100, and the integer is 0 or 1).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T17:01:33+00:00Added an answer on June 15, 2026 at 5:01 pm

    Eventually I gave on the pretty shell method, and wrote another version in C. It’s sad, it’s not pretty, but it’s more than three orders of magnitude faster (at a total run time of 43 seconds, compared to an estimated 28 hours for the awk method, given pre-cached data). It requires changing ulimit to allow enough open files, and if your lines are longer than LINE_LENGTH, it will not work correctly.

    Still, it runs 2300 times faster than the next best solution.

    If someone stumbles upon this looking to do this task, this will do it. Just be careful and check that it actually worked.

        #include <stdio.h>
        #include <stdlib.h>
    
        #define LINE_LENGTH 1024
    
        int main(int argc, char* argv[]) {
                int fn;
                int ln;
                char read[LINE_LENGTH];
    
                int fmax=10;
                int ftot=0;
                FILE** files=malloc(fmax*sizeof(FILE*));
                char fname[255];
                printf("%d arguments\n", argc);
    
                printf("opening %s\n",argv[1]);
                FILE* open = fopen(argv[1],"r");
    
                for(ln=0;fgets(read,LINE_LENGTH,open); ln++) {
                        if(ln==fmax) {
                                printf("%d has reached %d; reallocing\n",ln,fmax);
                                fmax*=2;
                                files=realloc(files,fmax*sizeof(FILE*));
                        }
                        sprintf(fname, "times/%09d.txt",ln);
                        files[ln]=fopen(fname,"w");
                        if(files[ln]==0) {
                                fprintf(stderr,"Failed at opening file number %d\n",ln);
                                return 1;
                        }
                        fprintf(files[ln],"%s",read);
                }
                ftot=ln;
                fclose(open);
    
                for(fn=2;fn<argc;fn++) {
                        printf("working on file %d\n",fn);
                        open=fopen(argv[fn],"r");
                        for(ln=0;fgets(read,LINE_LENGTH,open); ln++) {
                                fprintf(files[ln],"%s",read);
                        }
                        fclose(open);
                }
                for(ln=0;ln<ftot;ln++) {
                        fclose(files[ln]);
                }
                return 0;
        }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have large data files stored in S3 that I need to analyze. Each
I have a reasonably large data set and would like to store it in
I have a reasonably large set of strings (say 100) which has a number
Basically, I have a reasonably large list (a year's worth of data) of times
I have two test cases using a reasonably large json object (1.2mb): source: data
I'm currently working on a reasonably large data process task and need to split
I have some data which (quite reasonably) uses null and false for different meanings.
I'm using hierarchical clustering to try to visualize a large set of data that
I have a bunch of pretty large CSV (comma separated values) files and I
I have created a function which has a reasonably large number of parameters (all

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.