I am looking to replace from a large document all high unicode characters, such

Question

0

Asked: May 14, 20262026-05-14T22:51:27+00:00 2026-05-14T22:51:27+00:00

I am looking to replace from a large document all high unicode characters, such

0

I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with “normal” counterparts in the low range, such as a regular ‘E’, and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: http://www.designmeme.com/mtplugins/lowdown.txt

Is there a fast way of doing this in Python without using s.replace(…).replace(…).replace(…)…? I’ve tried this on just a few characters to replace and the document stripping became really slow.

EDIT, my version of unutbu’s code that doesn’t seem to work:

# -*- coding: iso-8859-15 -*-
import unidecode
def ascii_map():
    data={}
    for num in range(256):
        h=num
        filename='x{num:02x}'.format(num=num)
        try:
            mod = __import__('unidecode.'+filename,
                             fromlist=True)
        except ImportError:
            pass
        else:
            for l,val in enumerate(mod.data):
                i=h<<8
                i+=l
                if i >= 0x80:
                    data[i]=unicode(val)
    return data

if __name__=='__main__':
    s = u'“fancy“fancy2'
    print(s.translate(ascii_map()))

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T22:51:28+00:00

# -*- encoding: utf-8 -*-
import unicodedata

def shoehorn_unicode_into_ascii(s):
    return unicodedata.normalize('NFKD', s).encode('ascii','ignore')

if __name__=='__main__':
    s = u"éèêàùçÇ"
    print(shoehorn_unicode_into_ascii(s))
    # eeeaucC

Note, as @Mark Tolonen kindly points out, the method above removes some characters like
ß‘’“”. If the above code truncates characters that you wish translated, then you may have to use the string’s translate method to manually fix these problems. Another option is to use unidecode (see J.F. Sebastian’s answer).

When you have a large unicode string, using its translate method will be much
much faster than using the replace method.

Edit: unidecode has a more complete mapping of unicode codepoints to ascii.
However, unidecode.unidecode loops through the string character-by-character (in a Python loop), which is slower than using the translate method.

The following helper function uses unidecode‘s data files, and the translate method to attain better speed, especially for long strings.

In my tests on 1-6 MB text files, using ascii_map is about 4-6 times faster than unidecode.unidecode.

# -*- coding: utf-8 -*-
import unidecode
def ascii_map():
    data={}
    for num in range(256):
        h=num
        filename='x{num:02x}'.format(num=num)
        try:
            mod = __import__('unidecode.'+filename,
                             fromlist=True)
        except ImportError:
            pass
        else:
            for l,val in enumerate(mod.data):
                i=h<<8
                i+=l
                if i >= 0x80:
                    data[i]=unicode(val)
    return data

if __name__=='__main__':
    s = u"éèêàùçÇ"
    print(s.translate(ascii_map()))
    # eeeaucC

Edit2: Rhubarb, if # -*- encoding: utf-8 -*- is causing a SyntaxError, try
# -*- encoding: cp1252 -*-. What encoding to declare depends on what encoding your text editor uses to save the file. Linux tends to use utf-8, and (it seems perhaps) Windows tends to cp1252.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am looking to replace from a large document all high unicode characters, such

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply