I recently got access to a huge amount of server log data (at the new job). I have some experience in machine learning from college. The logs data include server logs, database access logs etc. I was wondering what kind of learning can be done from such a data.
One little thing i tried was to predict number of requests on a certain hour of the day based on the data of past week, which seemed ok but this is kind of trivial. So,
- What kind of learning can be done from such data?
- May be predicting the probability of an IP doing spam clicks on ads(yes the company is into that) based on some usage pattern of previous spammers?
- May be predicting at what time the traffic may shoot up.
- Are there any existing tools/projects which specifically leverage?
- Any interesting resources/papers which talk about similar stuff?
- Also, data related process activity at over a certain time on server. can this be any useful for learning?
Have a look at
Wei Xu et al (2010) Experience on Mining Google’s Production Console Logs
and the work they cite. In short they:
You probably cannot do 1. But maybe you can extract the variables writing your own "parser".
Also there has been a DARPA challenge to discover an attack in such data, but that’s nearly 15 years ago.
There are some tools like splunk, but apart from a nice interface they do not offer much beyond simple searching and filtering. UPDATE: There is a anomaly detection plugin by prelert.
I am not aware of much more. Please let me know if you find anything else.
So what I would do:
You probably do not have access to the source code that generated the messages as Xu had, but I assume that a large portion of the logs could be covered by a small number of patterns (e.g. all the firewall logs will have the same pattern). You can write a regex parsers extracting features from those logs (e.g. A connection was refused at certain time).