I need to change some characters that are not ASCII to ‘_’.
For example,
Tannh‰user -> Tannh_user
- If I use regular expression with Python, how can I do this?
- Is there better way to do this not using RE?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Updated for Python 3:
First we create byte string using
encode()– it uses UTF-8 codec by default. If you have byte string then of course skip this encode step.Then we convert it to “normal” string using the ascii codec.
This uses the property of UTF-8 that all non-ascii characters are encoded as sequence of bytes with value >= 0x80.
Original answer – for Python 2:
How to do it using built-in
str.decodemethod:(You get
unicodestring, so convert it tostrif you need.)You can also convert
unicodetostr, so one non-ASCII character is replaced by ASCII one. But the problem is thatunicode.encodewithreplacetranslates non-ASCII characters into'?', so you don’t know if the question mark was there already before; see solution from Ignacio Vazquez-Abrams.Another way, using
ord()and comparing value of each character if it fits in ASCII range (0-127) – this works forunicodestrings and forstrin utf-8, latin and some other encodings: