I’m combing a webapp’s log file for statements that stand out. Most of the

Question

0

Asked: May 11, 20262026-05-11T16:06:34+00:00 2026-05-11T16:06:34+00:00

I’m combing a webapp’s log file for statements that stand out. Most of the

0

I’m combing a webapp’s log file for statements that stand out.

Most of the lines are similar and uninteresting. I’d pass them through Unix uniq, however that filters nothing, as all the lines are slightly different: they all have a different timestamp, similar statements might print a different user ID, etc.

What’s a way and/or tool to get just the lines that are notably different from any other? (But, again, not precise duplicates)

I was thinking about playing with Python’s difflib but that seems geared toward diffing two files, rather than all pairs of lines in the same file.

[EDIT]

I assumed the solution would give a uniqueness score for each line. So by “notably different” I meant, I choose a threshold that the uniqueness score must exceed for any line to be included in the output.

Based on that, if there are other viable ways to define it, please discuss. Also, the method doesn’t have to have 100% accuracy and recall.

[/EDIT]

Examples:

I’d prefer answers that are as general purpose as possible. I know I can strip away the timestamp at the beginning. Stripping the end is more challenging, as its language may be absolutely unlike anything else in the file. These sorts of details are why I shied from concrete examples before, but because some people asked…

Similar 1:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:04:02 INFO  com.foo.Bar - URL:/graph?id=asdfghjk

Similar 2:

2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses
2009-04-20 00:06:00 INFO  com.baz.abc.Accessor - Cache /path/to/some/different/dir hits: 4352685 / 271315, 0.004423% misses

Different 1:

2009-04-20 00:03:57 INFO  com.foo.Bar - URL:/graph?id=1234
2009-04-20 00:05:59 INFO  com.baz.abc.Accessor - Cache /path/to/some/dir hits: 3466 / 16534, 0.102818% misses

In the Different 1 case, I’d like both lines returned but not other lines like them. In other words, those 2 lines are distinct types (then I can later ask for only statistically rare line types). The edit distance is much bigger between those two, for one thing.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-11T16:06:35+00:00

I don’t know a tool for you but if I were going to roll my own, I’d approach it like this:

Presumably the log lines have a well defined structure, no? So

parse the lines on that structure
write a number of very basic relevance filters (functions that just return a simple number from the parsed structure)
run the parsed lines through a set of filters, and cut on the basis of the total score
possibly sort the remaining lines into various bins by the results of more filters
generate reports, dump bins to files, or other output

If you are familiar with the unix tool procmail, I’m suggesting a similar treatment customized for your data.

As zacherates notes in the comments, your filters will typically ignore time stamps (and possibly IP address), and just concentrate on the content: for example really long http requests might represent an attack…or whatever applies to your domain.

Your binning filters might be as simple as a hash on a few selected fields, or you might try to do something with Charlie Martin’s suggestion and used edit distance measures.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m combing a webapp’s log file for statements that stand out. Most of the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply