I have an extensive set of Apache logs that I’m looking to parse. Specifically, there is a PHP script that runs on the site that passes arguments to a database to filter results to the public. This script, called “searchbox.php” passes three arguments (in its URL) that I’m interested in the results for:
- engine
- query
- subengine
The rest of the information is not valuable for me at this time. Here is the format for a single log entry:
sub.domain.com 123.456.789.456 - - [28/Jun/2012:00:04:00 -0500] "GET /sitescripts/search-box/searchbox.php?engine=catalog-vs-worldcat&query=law+enforcement+articles&x=0&y=0&subengine=iiikw HTTP/1.1" 302 20 "http://sub.domain.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0) Gecko/20100101 Firefox/12.0" - 0
The information I need is in the GET request, I just need a clean way of pulling those three bits of information out from these large log files and dumping it into either a CSV or tab delimited file.
I imagine this will be done in PHP but I will also entertain Python as well.
You could use regexes…