At work, I have started working on a program that can potentially generate hundreds of thousands of mostly small files an hour. My predecessors have found out that working with many small files can become very slow, so they have resorted to some (in my opinion) crude methods to alleviate the problem.
So I asked my boss why won’t we use a database instead and he gave me his oh-so-famous I-know-better-than-you look and told me obviously a database that big won’t have a good performance.
My question is, is it really so? It seems to me that a database engine should be able to handle such data much better than the file system. Here are the conditions we have:
- The program mostly writes data. Queries are much less frequent and their performance is not very important.
- Millions of files could be generated every day. Most of these are small (a few kilobytes) but some can be huge.
If you think we should opt with the database solution, what open source database system do you think will work best? (If I decide that a database will certainly work better, I’m going to push for a change whatever the boss says!)
This is another one of those “it depends” type questions.
If you are just writing data (write once, read hardly ever) then just use the file system. Maybe use a hash-directory approach to create lots of sub-directories (things tend to go slowly with many files in a single directory.
If you are writing hundreds of thousands of events for later querying (e.g. find everything with X > 10 and Y < 11) then a database sounds like a great idea.
If you are writing hundreds of thousands of bits of non-relational data (e.g. simple key-value pairs) then it might be worth investigating a NoSQL approach.
The best approach is probably to prototype all the ideas you can think of, measure and compare!