My specific problem is that I have a set of Apache access logs, and I want to extract from them a “rolled up” count of requests by grouping them into a set of time windows of a specified time.
Example of my data:
127.0.0.1 - - [01/Dec/2011:00:00:11 -0500] "GET / HTTP/1.0" 304 266 "-" "Sosospider+(+http://help.soso.com/webspider.htm)"
127.0.0.1 - - [01/Dec/2011:00:00:24 -0500] "GET /feed/rss2/ HTTP/1.0" 301 447 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=12878631678486589417)"
127.0.0.1 - - [01/Dec/2011:00:00:25 -0500] "GET /feed/ HTTP/1.0" 304 189 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=12878631678486589417)"
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET /robots.txt HTTP/1.0" 200 333 "-" "Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET / HTTP/1.0" 200 10011 "-" "Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"
as you can see, each line represents an event — in this case, a HTTP request — and contains a timestamp.
Assuming my data covers 3 days, and I specify a time window size of 1 day, I’d like to generate something like this:
Start End Count
2011-12-01 05:00 2011-12-02 05:00 2822
2011-12-02 05:00 2011-12-03 05:00 2572
2011-12-03 05:00 2011-12-04 05:00 604
But I need to be able to vary the size of the window — I might want to analyze a given dataset using windows of 5 minutes, 10 minutes, 1 hour, 1 day, or 1 week, etc.
I also need the library/tool to be capable of analyzing a dataset (a series of lines) of hundreds or even thousands of megabytes in size.
A prebuilt tool which can accept the data via standard input would be great, but a library would be totally fine, as I could just build the tool around the library. Any language would be fine; if I don’t know it I can learn it.
I’d prefer to do this by piping the access log data directly into a tool/library with minimal dependencies — I’m not looking for suggestions to store the data in a database and then query the database to do the analysis. If I need to, I can figure that out myself.
I tried Splunk and found it way too heavyweight and complex for my case. It’s not just a tool, it’s a whole system with its own datastore, complex indexing and querying abilities, etc.
My question is: does such a library and/or tool exist?
Full disclosure
I must admit, I actually tried and failed to find something like this a few months ago, so I wrote my own. For some reason I didn’t think to post this question at that time. I will share the lib/tool I wrote in an answer shortly. But I really am curious if something like this does exist; maybe I just missed it when I was searching a few months ago.
As mentioned in the question, I actually attempted a few months ago, unsuccessfully, to find something like this, so I wrote my own. (For some reason I didn’t think to post this question at that time.)
I took this as an opportunity to learn functional programming (FP) and to shore up my proficiency with CoffeeScript. So I wrote Rollups as a CoffeeScript tool which runs on Node. I’ve since added Scala and Clojure versions, as part of my further exploration of FP.
All the versions are intended to be usable as both a tool and a library, although they’re all only part of the way towards that — I think currently only the Clojure version is truly safe to use as a library, and I haven’t tested it that way.
The tools work as I described in my question. Given a file or set of files containing Apache access logs, I invoke them like so:
(or
rollup.coffee,rollup.scala) and the output is exactly like the example in the question.This tool solved my problem, and I’m no longer actively using it on a day-to-day basis. But I’d love to improve it further for others’ use, if I knew that others were using it. So feedback would be welcome!