I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) …
My question is : what’s the best way to store them ?
- Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
- Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
- Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
- Database Table with one record per line of text
- Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
- Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
I’d pick the very first solution.
I don’t see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most “raw” state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it’s the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.