I’m currently parsing a log file that has the following structure:
1) timestamp, preceded by # character and followed by \n
2) arbitrary # of events that happened after that timestamp and all followed by \n
3) repeat..
Here is an exmaple:
#100
04!
03!
02!
#1299
0L
0K
0J
0E
#1335
06!
0X#
0[#
b1010 Z$
b1x [$
...
Please forgive the seemingly cryptic values, they are encodings representing certain “events”.
Note: Event encodings may also use the # character.
What I am trying to do is to count the number of events that happen at a certain time.
In other words, at time 100, 3 events happened.
I am trying to match all text between two timestamps – and count the number of events by simply counting the number of newlines enclosed in the matched text.
I’m using Python’s regex engine, and I’m using the following expression:
pattern = re.compile('(#[0-9]{2,}.*)(?!#[0-9]+)')
Note: The {2,} is because I want timestamps with at least two digits.
I match a timestamp, continue matching any other characters until hitting another timestamp – ending the matching.
What this returns is:
#100
#1299
#1335
So, I get the timestamps – but none of the events data – what I really care about!
I’m thinking the reason for this is that the negative-lookbehind is “greedy” – but I’m not completely sure.
There may be an entirely different regex that makes this much simpler – open to any suggestions!
Any help is much appreciated!
-k
If you insist on a regex-based solution, I propose this:
Explanation:
You seemed to have misunderstood what negative lookahead does. When it follows
.*, the regex engine first tries to consume as many characters as possible and only then checks the lookahead pattern. If the lookahead does not match, it will backtrack character by character until it does.You could, however, use positive lookahead together with the non-greedy
.*?. Here the.*?will consume characters until the lookahead sees either a # at start of a line, or the end of the whole string: