I have a string, where
text='<tr align="right"><td>12</td><td>John</td>
and I would like to extract the tuple (’12’, ‘John’). It is working fine when I am using
m=re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
but I am getting (‘2’, ‘John’), when I am using
m=re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m
Why is it going wrong? I mean why .{13} works fine, but .+ fails to work in my re?
Thank you!
I can’t actually test this with the sample text and regexps you provided, because as written they clearly should find no matches, and in fact do find no matches in both 2.7 and 3.3.
But I’m guessing that you want a non-greedy match, and changing
.+to.+?will fix whatever your problem is.As Jon Clements points out in his answer, you really shouldn’t be using regular expressions here. Regexps cannot actually parse non-regular languages like XML. Of course, despite what the purists say, regexps can still be a useful hack for non-regular languages in quick&dirty cases. But as soon as you run into something that isn’t working, the first think you ought to do is consider that maybe this isn’t one of those quick&dirty cases, and you should look for a real parser. Even if you’d never used the
ElementTreeAPI before, or XPath, they’re pretty easy to learn, and the time spent learning is definitely not wasted, as it will come in handy many times in the future.But anyway… let’s reduce your sample to something that works as you describe, and see what this does:
I think this is what you were complaining about. Well,
.+is not “not working properly”; it’s doing exactly what you asked it to: match at least one character, and as many as possible, up to the point where the rest of the expression still has something to match. Which includes matching the1, because the rest of the expression still matches.If you want it to instead stop matching as soon as the rest of the expression can take over, that’s a non-greedy match, not a greedy match, so you want
+?rather than+. Let’s try it:Tada.