I have a string, where text='<tr align=right><td>12</td><td>John</td> and I would like to extract the

Question

0

Asked: June 15, 20262026-06-15T18:28:19+00:00 2026-06-15T18:28:19+00:00

I have a string, where text='<tr align=right><td>12</td><td>John</td> and I would like to extract the

0

I have a string, where

text='<tr align="right"><td>12</td><td>John</td>

and I would like to extract the tuple (’12’, ‘John’). It is working fine when I am using

m=re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)

print m

but I am getting (‘2’, ‘John’), when I am using

m=re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
print m

Why is it going wrong? I mean why .{13} works fine, but .+ fails to work in my re?
Thank you!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T18:28:20+00:00

I can’t actually test this with the sample text and regexps you provided, because as written they clearly should find no matches, and in fact do find no matches in both 2.7 and 3.3.

But I’m guessing that you want a non-greedy match, and changing .+ to .+? will fix whatever your problem is.

As Jon Clements points out in his answer, you really shouldn’t be using regular expressions here. Regexps cannot actually parse non-regular languages like XML. Of course, despite what the purists say, regexps can still be a useful hack for non-regular languages in quick&dirty cases. But as soon as you run into something that isn’t working, the first think you ought to do is consider that maybe this isn’t one of those quick&dirty cases, and you should look for a real parser. Even if you’d never used the ElementTree API before, or XPath, they’re pretty easy to learn, and the time spent learning is definitely not wasted, as it will come in handy many times in the future.

But anyway… let’s reduce your sample to something that works as you describe, and see what this does:

>>> text='<tr align="right"><td>12</td><td>John</td> 
SyntaxError: EOL while scanning string literal
>>> text='<tr align="right"><td>12</td><td>John</td>'
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.{13}(\d+).*([A-Z]\w+)', text)
[('12', 'John')]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+).*([A-Z]\w+)', text)
[]
>>> re.findall(r'align.+(\d+).*([A-Z]\w+)', text)
[('2', 'John')]

I think this is what you were complaining about. Well, .+ is not “not working properly”; it’s doing exactly what you asked it to: match at least one character, and as many as possible, up to the point where the rest of the expression still has something to match. Which includes matching the 1, because the rest of the expression still matches.

If you want it to instead stop matching as soon as the rest of the expression can take over, that’s a non-greedy match, not a greedy match, so you want +? rather than +. Let’s try it:

>>> re.findall(r'align.+?(\d+).*([A-Z]\w+)', text)
[('12', 'John')]

Tada.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a string, where text='<tr align=right><td>12</td><td>John</td> and I would like to extract the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply