I have a CP1252 encoded text file, where I match patterns using python regex.

Question

0

Asked: May 23, 20262026-05-23T15:53:24+00:00 2026-05-23T15:53:24+00:00

I have a CP1252 encoded text file, where I match patterns using python regex.

0

I have a CP1252 encoded text file, where I match patterns using python regex. For example, the following text can be matched by regex string '1\s*(\w*)\s*(<.*$)'

1   kAMpleksa       <fs af='kAMpleksa,unk,,,,,,'>

But when the text contains special characters like the accented ‘U’ in the following text, the regex fails to match.

1   aBiyukÙwa       <fs af='aBiyuk,unk,,,,,,'>

I am reading the text from the file using python’s codecs module using the following syntax:

codecs.open('/home/abcl/TokenOutput.wx', 'r', 'cp1252')

Any ideas, how to go about it?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T15:53:25+00:00

Editorial Team

2026-05-23T15:53:25+00:00Added an answer on May 23, 2026 at 3:53 pm

This works on both my machines, but I’m copying and pasting the text in, so there might be some invisible translation happening. Have you tried setting the unicode flag? As in

'(?u)1\s*(\w*)\s*(<.*$)'

or

re.match(r, t, flags=re.U).group()

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a CP1252 encoded text file, where I match patterns using python regex.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply