I am interesting in making an app where users can upload large files (~2MB) that are converted into html documents. This application will not have a database. Instead, these html files are stored in a particular writable directory outside of the document source tree. Thus this directory will grow larger and larger as more files are added to it. Users should be able to view these html files by visiting the appropriate url. All security concerns aside, what do I need to be worried about if this directory continues to grow? Will accessing the files inside take longer when there are more of them? Will it potentially crash because of this? Should I create a new directory every 100 files or so to prevent this?
It it is important, I want to make this app using pyramid and python
You might want to partition the directories by user, app or similar so that it’s easy to manage anyway – like if a user stops using the service you could just delete their directory. Also I presume you’ll be zipping them up. If you keep it well decoupled then you’ll be able to change your mind later.
I’d be interested to see how using something like SQLite would work for you, as you could have a sqlite db per partitioned directory.
I presume HTML files are larger than the file they uploaded, so why store the big HTML file.
Things like Mongodb etc are out of the question? as is your app scales with multiple servers you’ve the issue of accessing other files on a different server, unless you pick the right server in the first place using some technique. Then it’s possible you’ve got servers sitting idle as no one wants there documents.
Why the limitation on just storing files in a directory, is it a POC?
EDIT
I find value in reading things like http://blog.fogcreek.com/the-trello-tech-stack/ and I’d advise you find a site already doing what you do and read about their tech. stack.
As someone already commented why not use Amazon S3 or similar.
Ask yourself realistically how many users do you imagine and really do you want to spend a lot of energy worrying about being the next facebook and trying to do the ultimate tech stack for the backend when you could get your stuff out there being used.
Years ago I worked on a system that stored insurance certificates on the filesystem, we use to run out of inodes.!
Dare I say it’s a case of suck it and see what works for you and your app.
EDIT
HAProxy I believe are meant to handle all that load balancing concerns.
As I imagine as a user I wants to http://docs.yourdomain.com/myname/document.doc
although I presume there are security concerns of it being so obvious a name.