I’m working with some text that has a mix of languages, which I’ve already

Question

0

Asked: June 17, 20262026-06-17T19:48:18+00:00 2026-06-17T19:48:18+00:00

I’m working with some text that has a mix of languages, which I’ve already

0

I’m working with some text that has a mix of languages, which I’ve already done some processing on and is in the form a list of single characters (called “letters”). I can tell which language each character is by simply testing if it has case or not (with a small function called “test_lang”). I then want to insert a space between characters of different types, so I don’t have any words that are a mix of character types. At the same time, I want to insert a space between words and punctuation (which I defined in a list called “punc”). I wrote a script that does this in a very straight-forward way that made sense to me (below), but apparently is the wrong way to do it, because it is incredibly slow.

Can anyone tell me what the better way to do this is?

# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ""
i = 0
while i <= len(letters)-2: #range excludes last letter to avoid Out of Range error for i+1
    cleaned += letters[i]
    # words that have case are Latin; otherwise Arabic
    if test_lang(letters[i]) != test_lang(letters[i+1]):
        cleaned += " "
    if letters[i] in punc or letters[i+1] in punc:
        cleaned += " "
    i += 1
cleaned += letters[len(letters)-1] # add in last letter

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T19:48:19+00:00

There are a few things going on here:

You call test_lang() on every letter in the string twice, this is probably the main reason this is slow.
Concatenating strings in Python isn’t very efficient, you should instead use a list or generator and then use str.join() (most likely, ''.join()).

Here is the approach I would take, using itertools.groupby():

from itertools import groupby
def keyfunc(letter):
    return (test_lang(letter), letter in punc)

cleaned = ' '.join(''.join(g) for k, g in groupby(letters, keyfunc))

This will group the letters into consecutive letters of the same language and whether or not they are punctuation, then ''.join(g) converts each group back into a string, then ' '.join() combines these strings adding a space between each string.

Also, as noted in comments by DSM, make sure that punc is a set.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working with some text that has a mix of languages, which I’ve already

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply