I have some processing I want to do on thousands of files simultaneously. Grab the first byte of all the files and do something, go to the next byte, etc. The files could be any size, so loading them all into memory could be prohibitive.
I’m concerned that due to limitations in operating system file descriptors, just naively opening thousands of files and reading them in seems like I might run into issues.
But cycling through and opening/closing files would be rather inefficient, I imagine.
Is there some efficient mechanism to handle what I’m trying to do?
NOTE: this function may be distributed to use machines that I would have no control over, so I can’t just go changing settings on the OS.
Are these files small enough that you could read them all into memory at once. If so, then read the files one at a time, then process all the files a byte at a time.
You might. The only way to find out is to try.
Yes it would. But if you can’t read all the files into memory, and your operating system can’t open thousands of files at a time, then this is your last resort.
What you can do is find out the limit of simultaneous open files that your system can handle. Let’s just say for the sake of discussion that your system can open 100 files at a time, and you have 2,500 files to process.
Then your process would look something like this.
Now, after running this process through all your files, you’ll have 25 intermediate files.
Then your second process would look something like this.
You would determine the actual numbers (simultaneous files open, number of intermediate files) through experimentation or research on your operating system.