Here is the pattern
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
join_pattern = '[' + '|'.join(pattern_strings) + ']'
pattern = re.compile(join_pattern)
Here is the function
def find_pattern(path):
with open(path, 'r') as f:
for line in f:
# print line
found = pattern.search(line)
if found:
print dir(found)
logging.info('found in line - ' + line)
logging.info('found - ' + str(found.group(0)))
Here is the input
\xc2d
d\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
'619d813\xa03697'
When I run this, I get output as
INFO:root:found in line - \xc2d
INFO:root:found - d
INFO:root:found in line - d\xa0
INFO:root:found - d
INFO:root:found in line - \xc3\ufffdd
INFO:root:found - u
INFO:root:found in line - \xc3\ufffdd
INFO:root:found - u
INFO:root:found in line - '619d813\xa03697'
INFO:root:found - d
Question
– Why doesn’t it tells the entire pattern like \xc2d? am I doing something incorrect here?
– What is that I need to do in order to get the pattern matched like \xc2d instead of d
UPDATE
chaging to join_pattern = '(' + '|'.join(pattern_strings) + ')' doesn’t matches anything
UPDATE 1
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
join_pattern = '|'.join(pattern_strings)
pattern = re.compile(join_pattern)
This doesn’t matches anything in input 🙁
Square brackets in
redenotes a setjoin_pattern = '[' + '|'.join(pattern_strings) + ']'causes the regex to match “any one of the set of characters in{ \ x c 2 d a 0 e 7 3 u f 9 | }“. This is probably not the behavior you want. For the expression you want just use:No need for parentheses, unlesss you are trying to specify capture/non-capture groups.