I have a process that’s going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They’ll be pretty small (50K) each, but what I’m worried about is how to manage the total mass of files I’m generating for easy lookup. Some details:
- I’ll have some other steps to run once a file have been generated, and there will be a few servers participating, so I’ll need to watch for files as they’re generated.
- Once generated, the files will be available though a lookup process I’ve written. Essentially, I’ll need to pull them based on an order number, which is unique per file.
- At any time, an existing order number may be resubmitted, and the generated file will need to overwrite the original copy.
Originally, I had planned to write these files all to a single directory on a NAS, but I realize this might not be a good idea, since there are millions of them and Windows might not handle a million-file-lookup very gracefully. I’m looking for some advice:
- Is a single folder okay? The files will never be listed – they’ll only be retrieved using a System.IO.File with a filename I’ve already determined.
- If I do a folder, can I watch for new files with a System.IO.DirectoryWatcher, even with that many files, or will it start to become sluggish with that many files?
- Should they be stored as BLOBs in a SQL Server database instead? Since I’ll need to retrieve them by a reference value, maybe this makes more sense.
Thank you for your thoughts!
I’d group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I’d say a few 1000 max. (There’s probably an ideal number out there, hopefully someone will post it.)
Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?
There’s arguments on both sides of storing files in a database.
A last point to worry about is keeping the data “aligned”. If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.