How can I fix this RegEx to optionally capture a file extension?
I am trying to match a string with an optional component, but something appears to be wrong. (The strings being matched are from a printer log.)
My RegEx (.NET Flavor) is as follows:
.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).* ------------------------------------------- .* # Ignore some garbage in the front (header_ # Match the start of the file name, \d{10,11}_) # including the ID (10 - 11 digits) .* # Ignore the type code in the middle (_.*_\d{8}) # Match some random characters, then an 8-digit date .* # Ignore anything between this and the file extension (\.\w{3,4}) # Match the file extension, 3 or 4 characters long .* # Ignore the rest of the string
I expect this to match strings like:
str1 = 'header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]' str2 = 'Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt' str3 = 'header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]'
Where the capture groups return something like:
$1 = header_0000000602_ $2 = _mc2e1nrobr1a3s55niyrrqvy_20081212 $3 = .doc
Where $3 can be empty if no file extension is found. $3 is the optional part, as you can see in str3 above.
If I add ‘?’ to the end of the third capture group ‘(.\w{3,4})?’, the RegEx no longer captures $3 for any string. If I add ‘+’ instead ‘(.\w{3,4})+’, the RegEx no longer captures str3 at all, which is to be expected.
I feel that using ‘?’ at the end of the third capture group is the appropriate thing to do, but it doesn’t work as I expect. I am probably being too naive with the ‘.*’ sections that I use to ignore parts of the string.
Doesn’t Work As Expected:
.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*
One possibility is that the second to last
.*is being greedy. You might try changing it to:That wasn’t correct, this one will match the input you supplied, but it assumes that the first
.it encounters is the start of a file extension:Edit: Remove the escaping I had in the second regex.