I have a CP1252 encoded text file, where I match patterns using python regex. For example, the following text can be matched by regex string '1\s*(\w*)\s*(<.*$)'
1 kAMpleksa <fs af='kAMpleksa,unk,,,,,,'>
But when the text contains special characters like the accented ‘U’ in the following text, the regex fails to match.
1 aBiyukÙwa <fs af='aBiyuk,unk,,,,,,'>
I am reading the text from the file using python’s codecs module using the following syntax:
codecs.open('/home/abcl/TokenOutput.wx', 'r', 'cp1252')
Any ideas, how to go about it?
This works on both my machines, but I’m copying and pasting the text in, so there might be some invisible translation happening. Have you tried setting the unicode flag? As in
or