I could really use some help with a Python regular expression problem. You’d expect the result of
import re
re.sub("s (.*?) s", "no", "this is a string")
to be “this is no string”, right? But in reality it’s “thinotring”. The sub function uses the entire pattern as the group to replace, instead of just the group I actually want to replace.
All re.sub examples deal with simple word replacement, but what if you want to change something depending on the rest of the string? Like in my example…
Any help would be greatly appreciated.
Edit:
The look-behind and look-forward tricks won’t work in my case, as those need to be fixed width. Here is my actual expression:
re.sub(r"<a.*?href=['\"]((?!http).*?)['\"].*?>", 'test', string)
I want to use it to find all links in a string that don’t begin with http, so I can but a prefix in front of those links (to make them absolute rather then relative).
Your regex matches everything from the first s to the last s, so if you replace the match with “no”, you get “thinotring”.
The parentheses don’t limit the match, they capture the text matched by whatever is inside them in a special variable called backreference. In your example, backreference number 1 would contain
is a. You can refer to a backreference later in the same regex using backslashes and the number of the backreference:\1.What you probably want is lookaround:
(?<=s )means: Assert that it is possible to matchsbefore the current position in the string, but don’t make it part of the match.Same for
(?= s), but it asserts that the string will continue withsafter the current position.Be advised that lookbehind in Python is limited to strings of fixed length. So if that is a problem, you can sort of work around this using…backreferences!
OK, this is a contrived example, but it shows what you can do. From your edit, it’s becoming apparent that you’re trying to parse HTML with regex. Now that is not such a good idea. Search SO for “regex html” and you’ll see why.
If you still want to do it:
might work. But this is extremely brittle.