I’m currently working on implementing a logging system for a new Hadoop cluster I’ve setup. The way I’ve always seen these setup in the past was logging split off by days with individual files split off at around 10x HDFS block size. I haven’t had any problems with this methodology when I’ve needed to use it, but after a discussion with a coworker who wanted to store logs in one long file, I realized I wasn’t really sure why the 10x methodology I mentioned was used. The reasons I could think of are:
- mapreduce jobs will run significantly faster when we’re only interested in a couple of days.
- files can be zipped/tar’d/lzo’d up to save space.
Are there others? I couldn’t really figure out why people shard files for a single day by the 10x HDFS block size level. I figure for my knowledge of theory it would be very cool to know more about the philosophy of why logs are stored at different sizes.
The bigger your files, the better job the JobTracker will do scheduling your jobs. Super small files will mean lots of tasks, which will create bad performance. However, having mega huge files doesn’t let you query just parts of your dataset. You need to find a balance between how much data you’re producing per day, and how big your files will be. If you can produce 10x the block size per day, then have a file per day — that way it’ll be easy for you to query only 5 days worth. Otherwise, consider creating an ETL job to concat files together.