I am on Google App Engine with Python 2.7 and here is the code snippet:
# -*- coding: utf-8 -*-
KEYWORD = u"英語"
URL = u"http://www.google.com/"
content = u"和製英語(わせいえいご)とは、日本で作られた英語風の日本語語彙のことである。"
p=re.compile(u'('+ KEYWORD +u')(?!(([^<>]*?)>)|([^>]*?</a>))',re.UNICODE)
output=p.sub(u'<a href="'+ URL +'">\1</a>',content)
The regular expression and p.sub worked correctly but the backreference \1 wont work! The output of \1 is something like this: ន
I try to modify the code with encode('utf-8') but the result is the same:
p=re.compile(u'('+ KEYWORD +u')(?!(([^<>]*?)>)|([^>]*?</a>))'.encode('utf-8'),re.UNICODE)
output=p.sub(u'<a href="'+ URL +'">\1</a>'.encode('utf-8'),content.encode('utf-8'))
Can anyone told me what is wrong?
Turn the string with
\1into a raw string by adding anrimmediately before it:This prevents the 1 from being interpreted as a backreferenced 1. Proof:
prints
Only the latter works (英語 is within the google link).