Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7787441
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T20:40:16+00:00 2026-06-01T20:40:16+00:00

The input is a single file of around 70GB whose every single line contains

  • 0

The input is a single file of around 70GB whose every single line contains client information. A program reads this file and makes one file for each client. There are 8000 clients but we have to provision for 40000 clients. Currently the UNIX sort command is used to sort the file by client and then client files are written. This way the program has only a single file handler open for creating the files.
We do not want to use the sort command as it consumes around 1.5 hours. This would however mean that 8000 file handlers open will need to remain open. Kernel parameters may need to be modified.
Is it possible to open so many files without changing the kernel parameters. I tried going through the libevent website but am not sure if that is the correct solution.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T20:40:19+00:00Added an answer on June 1, 2026 at 8:40 pm

    You don’t necessarily need 8000 file handles open at the same time, nor do you need to sort the data. Sorting is a waste unless you need each clients lines sorted as well.

    Presumably, you can identify the client by some item on the line. Let’s say (for example) it’s the first 8 characters on each line, then your pseudo-code looks like this:

    delete all files matching "*_out.dat"
    for each line in file:
        key = left (line, 8)
        open file key + "_out.dat" for append
        write line to file
        close file
    

    That’s it. Simple. Only one file open at a time and no time wasted sorting.

    Now there are further improvements that could be made to that, amongst them being:

    1. Don’t close the file for the previous line unless the next line has a different key. This will catch situations where you have a hundred records in a row for the same key and will hold the file open in that case.

    2. Maintain a cache of open file handles like a most recently used list (say 16 different keys). Again, this will prevent closure until a file handle has to be reused but it will laso handle the situation where clusters are more efficient (such as customer 1,2,3,7,1,2,3,2,2,3,7,4,…).

    But the basic theory remains the same: don’t try to open 8000 (or 40000) files at once when you can get by with less.


    Alternatively, just process the data, stashing it all into a database and using queries to then create each file with a series of queries. Whether that’s faster than the solution above should be tested, as in fact should every suggestion given here. Measure, don’t guess!


    Now, since I’ve invoked that optimisation mantra, let’s do some timings, keeping in mind that this is specific to my hardware and may be different on yours.

    Start with the following script, which generates a one-million-line file where the first eight characters of each line are a random number between 10000000 and 10032767 inclusive. We’ll use characters 5 through 8 inclusive to give us the customer number, ten thousand customers at roughly a hundred lines per customer:

    #!/bin/bash
    line='the quick brown fox jumps over the lazy dog'
    for p0 in 1 2 3 4 5 6 7 8 9 0 ; do
     for p1 in 1 2 3 4 5 6 7 8 9 0 ; do
      for p2 in 1 2 3 4 5 6 7 8 9 0 ; do
       for p3 in 1 2 3 4 5 6 7 8 9 0 ; do
        for p4 in 1 2 3 4 5 6 7 8 9 0 ; do
         for p5 in 1 2 3 4 5 6 7 8 9 0 ; do
          ((x = 10000000 + $RANDOM))
          echo "$x$line"
         done
        done
       done
      done
     done
    done
    

    The file size produced is about 50M. We can scale it up to 100M by simply concatenating 2 copies of it to another file, and this gives us about two hundred lines per customer.


    Now, examine the following program:

    #include <stdio.h>
    #include <string.h>
    #include <errno.h>
    
    #define FOUT_STR "1234_out.dat"
    
    int main (void) {
        FILE *fIn, *fOut;
        char outFile[sizeof (FOUT_STR)];
        char buff[1000];
    
        if ((fIn = fopen ("data.dat", "r")) == NULL) {
            printf ("Error %d opening 'data.dat'\n", errno);
            return 1;
        }
    
        memcpy (outFile, FOUT_STR, sizeof (FOUT_STR));
        if ((fOut = fopen (outFile, "w")) == NULL) {
            printf ("Error %d opening '%s'\n", errno, outFile);
            fclose(fIn);
            return 1;
        }
    
        while (fgets (buff, sizeof (buff), fIn) != NULL) {
            fputs (buff, fOut);
        }
    
        fclose (fOut);
        fclose (fIn);
        return 0;
    }
    

    This gives a baseline figure for just writing all entries to a single file, and takes under a second to run.


    Now let’s have one which opens a new file every two hundred lines – this is the behaviour you’d see if the file was already sorted by customer:

    #include <stdio.h>
    #include <string.h>
    #include <errno.h>
    
    #define FOUT_STR "1234_out.dat"
    
    int main (void) {
        FILE *fIn, *fOut;
        char outFile[sizeof (FOUT_STR)];
        char buff[1000];
        char custNum[5];
        int i = -1;
    
        if ((fIn = fopen ("data.dat", "r")) == NULL) {
            printf ("Error %d opening 'data.dat'\n", errno);
            return 1;
        }
    
        fOut = NULL;
        while (fgets (buff, sizeof (buff), fIn) != NULL) {
            i++;
            if ((i % 200) == 0) {
                if (fOut != NULL)
                    fclose (fOut);
                sprintf (custNum, "%04d", i / 200);
                memcpy (outFile, FOUT_STR, sizeof (FOUT_STR));
                memcpy (outFile, custNum, 4);
                if ((fOut = fopen (outFile, "w")) == NULL) {
                    printf ("Error %d opening '%s'\n", errno, outFile);
                    break;
                }
            }
            fputs (buff, fOut);
        }
        if (fOut != NULL)
            fclose (fOut);
    
        fclose (fIn);
        return 0;
    }
    

    This takes about 2s (0:00:02) for the 100M file, and testing it with a 200M and 400M file indicates that it scales linearly. That means that with a sorted 70G file, you’re looking at about 1400s or 0:23:20. Note that this would be on top of your sort cost which you have as 1.5h (1:30:00), giving you a total cost of 1:53:20.


    Now let’s implement out simplistic program which simply opens each file for append for every line:

    #include <stdio.h>
    #include <string.h>
    #include <errno.h>
    
    #define FOUT_STR "1234_out.dat"
    
    int main (void) {
        FILE *fIn, *fOut;
        char outFile[sizeof (FOUT_STR)];
        char buff[1000];
    
        if ((fIn = fopen ("data.dat", "r")) == NULL) {
            printf ("Error %d opening 'data.dat'\n", errno);
            return 1;
        }
    
        while (fgets (buff, sizeof (buff), fIn) != NULL) {
            memcpy (outFile, FOUT_STR, sizeof (FOUT_STR));
            memcpy (outFile, &(buff[4]), 4);
            if ((fOut = fopen (outFile, "a")) == NULL) {
                printf ("Error %d opening '%s'\n", errno, outFile);
                break;
            }
            fputs (buff, fOut);
            fclose (fOut);
        }
    
        fclose (fIn);
        return 0;
    }
    

    When we run this with the 100M file, it takes 244s (0:04:04). And again, testing with a 200M and 400M file indicates linear scaling. So, for the 70G file, that’s going to be 47:26:40, not really what you would call an improvement over your sub-two-hour sort-and-process option.


    However, let’s try a different tack, with the following program, which maintains a hundred file handles each time through the input file (done a hundred times):

    #include <stdio.h>
    #include <string.h>
    #include <errno.h>
    
    #define FOUT_STR "1234_out.dat"
    
    int main (void) {
        FILE *fIn, *fOut[100];
        char outFile[sizeof (FOUT_STR)];
        char buff[1000];
        int seg, cust;
        char segNum[3], custNum[3];
    
        for (seg = 0; seg < 100; seg++) {
            sprintf (segNum, "%02d", seg);
    
            if ((fIn = fopen ("data.dat", "r")) == NULL) {
                printf ("Error %d opening 'data.dat'\n", errno);
                return 1;
            }
    
            for (cust = 0; cust < 100; cust++) {
                sprintf (custNum, "%02d", cust);
    
                memcpy (outFile, FOUT_STR, sizeof (FOUT_STR));
                memcpy (outFile+0, segNum, 2);
                memcpy (outFile+2, custNum, 2);
                if ((fOut[cust] = fopen (outFile, "w")) == NULL) {
                    printf ("Error %d opening '%s'\n", errno, outFile);
                    return 1;
                }
            }
    
            while (fgets (buff, sizeof (buff), fIn) != NULL) {
                if (memcmp (&(buff[4]), segNum, 2) == 0) {
                    cust = (buff[6] - '0') * 10 + buff[7] - '0';
                    fputs (buff, fOut[cust]);
                }
            }
    
            for (cust = 0; cust < 100; cust++) {
                fclose (fOut[cust]);
            }
    
            fclose (fIn);
        }
    
        return 0;
    }
    

    This is a slight variation which actually processes the input file a hundred times, each time only processing the lines meant for a hundred individual output files.

    When this is run on the 100M file, it takes about 28s (0:00:28). Once again, that seems to scale linearly for a 200M and 400M file, so the 70G file should take 5:26:40.

    Still not even close to the sub-two-hour figure.


    So what happens when we open a thousand output files at a time:

    #include <stdio.h>
    #include <string.h>
    #include <errno.h>
    
    #define FOUT_STR "1234_out.dat"
    
    int main (void) {
        FILE *fIn, *fOut[1000];
        char outFile[sizeof (FOUT_STR)];
        char buff[1000];
        int seg, cust;
        char segNum[2], custNum[4];
    
        for (seg = 0; seg < 10; seg++) {
            sprintf (segNum, "%01d", seg);
    
            if ((fIn = fopen ("data.dat", "r")) == NULL) {
                printf ("Error %d opening 'data.dat'\n", errno);
                return 1;
            }
    
            for (cust = 0; cust < 1000; cust++) {
                sprintf (custNum, "%03d", cust);
    
                memcpy (outFile, FOUT_STR, sizeof (FOUT_STR));
                memcpy (outFile+0, segNum, 1);
                memcpy (outFile+1, custNum, 3);
                if ((fOut[cust] = fopen (outFile, "w")) == NULL) {
                    printf ("Error %d opening '%s'\n", errno, outFile);
                    return 1;
                }
            }
    
            while (fgets (buff, sizeof (buff), fIn) != NULL) {
                if (memcmp (&(buff[4]), segNum, 1) == 0) {
                    cust = (buff[5] - '0') * 100 + (buff[6] - '0') * 10 + buff[7] - '0';
                    fputs (buff, fOut[cust]);
                }
            }
    
            for (cust = 0; cust < 1000; cust++) {
                fclose (fOut[cust]);
            }
    
            fclose (fIn);
        }
    
        return 0;
    }
    

    That takes about 12s for the 100M file, and would give us 2:20:00, coming close to the sort but not quite there.


    Unfortuntely, when we go the next logical step, trying to opening up the entire 10000 files in one hit, we see:

    Error 24 opening '1020_out.dat'
    

    meaning that we’ve finally come to the limit (standard input, standard output, standard error and about 1019 other file handles), which indicates that 1024 handles is about all we are allowed.

    So maybe the sort-and-process method is the best way to go.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there a simple way to upload a single file through HTML input type
I'm trying to combine multiple files in multiple input directories into a single file,
I'm working on a program that takes a redirected file as input. For example,
I've learned how to store a single input of data and return a value
I'm having trouble getting the jQuery validation plugin to validate single input elements on
We're letting users search a database from a single text input and I'm having
Assume I have a set of numbers like 1,2,3,4,5,6,7 input as a single String
I need a quick text input dialog box (MessageBox with a single text box
I have in input, which could be a single primitive or a list or
I'm writing a bash script in which I read single characters from the input.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.