I always work on Arabic text files and to avoid problems with encoding I

Question

0

Asked: June 13, 20262026-06-13T01:56:44+00:00 2026-06-13T01:56:44+00:00

I always work on Arabic text files and to avoid problems with encoding I

0

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter’s scheme (http://www.qamus.org/transliteration.htm)

Here is my code to do so but it’s very SLOW even with small files like 400 kb. Ideas to make it faster?

Thanks

     def transliterate(file):
          data = open(file).read()
          buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}    
          for char in data: 
               for k, v in arabBuck.iteritems():
                     data = data.replace(k,v)                 
      return data

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T01:56:45+00:00

Edit Oct 2021

There was a python package recently released that does this (and a lot more), so anyone reading this post now should ignore all the other answers and just use Camel Tools. (Nizar Habash and his team at NYU Abu Dhabi are awesome for developing this and making it so accessible!)

::python
from camel_tools.utils.charmap import CharMapper
sentence = "ذهبت إلى المكتبة."
print(sentence)

ar2bw = CharMapper.builtin_mapper('ar2bw')

sent_bw = ar2bw(sentence)
print(sent_bw)

Output:

هبت إلى المكتبة.
*hbt <lY Almktbp.

You can find install instructions and tutorials here: https://github.com/CAMeL-Lab/camel_tools

Old answer
Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own:
buckwalter2unicode.py

It probably does more than what you need, but you don’t have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.

Edit:
The script above is what I have been using, but I’m just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I always work on Arabic text files and to avoid problems with encoding I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply