What are common approaches for translating certain words (or expressions) inside a given text, when the text must be reconstructed (with punctuations and everythin.) ?
The translation comes from a lookup table, and covers words, collocations, and emoticons like L33t, CUL8R, :-), etc.
Simple string search-and-replace is not enough since it can replace part of longer words (cat > dog ≠> caterpillar > dogerpillar).
Assume the following input:
s = "dogbert, started a dilbert dilbertion proces cat-bert :-)"
after translation, i should receive something like:
result = “anna, started a george dilbertion process cat-bert smiley“
I can’t simply tokenize, since i loose punctuations and word positions.
Regular expressions, works for normal words, but don’t catch special expressions like the smiley 🙂 but it does .
re.sub(r'\bword\b','translation',s) ==> translation
re.sub(r'\b:-\)\b','smiley',s) ==> :-)
for now i’m using the above mentioned regex, and simple replace for the non-alphanumeric words, but it’s far from being bulletproof.
(p.s. i’m using python)
I had a similar problem with standard emoticons to be replaced with values. Here is a list of emoticons. I had them in a plain text file (so that I can append/delete to it as and when required) separated by tab like.
Then read it into a dictionary
Then a lookup function
Call the function with
As for L33t-speak I have a separate file slangs.txt, which looks like
A similar function to read it to dictionary slangs{} and a similar function to replace the slangs.
From Python library the re.escape()
Based on your needs you might want to use re.findall()