I’m looking for a way to improve this regular expression:
^(?:([^.]+).?){6}_tid
This extracts the 6th field of a point.separated.string.of.arbitrary.lengths up to “_tid”
So if it looks like this:
mc11_7tev.138345.dgnol_tb6_m12u_140_140_110_2l_jimmy_susy.evgen.log.e825_tid431423_0
it should return
e825
Funnily enough, if I remove the _tid part of the regex ^(?:([^.]+).?){6}, I get the performance I was looking for.. 1 to 2 seconds for a million strings to check.
With the _tid.. it takes up to 5 minutes.
Is there a better way to do this?
EDIT:
Ah, I forgot to mention, this is in Apache Pig, so everything should be in the regex clause.
You forgot to escape the dot, try this
this way your regex has much less possibilities to match. The “.” without escaping matches any character (without line break characters).
The other possibility I see is getting rid of the optional dot
See it here on Regexr