import http.client, urllib.request, urllib.parse, urllib.error
def translate(IN, OUT, text):
text = urllib.parse.quote(text)
conn = http.client.HTTPConnection("translate.google.com.tr")
conn.request("GET", "/translate_a/t?client=t&text="+text+"&hl="+IN+"&tl="+OUT)
res = conn.getresponse().read().decode("cp1254",'replace')
print(res)
b1 = res.split("],[")
b2 = b1[0].strip('[]')
b3 = b2.strip('","')
b4 = b3.split('","')
return b4[0]
string = input("Turkish >>> English: ")
result = translate("tr","en",string)
print(string,">>>",result)
im trying to write a script which can translate Turkish into English. That script works well if i dont type Turkish character. For example these Turkish words translated successfully = (kalemlik,deneme,bilgisayar,okyanus) but if the word i typed has a non-ascii character then translate is unsuccessful. These are Turkish characters = (“ıİğĞüÜşŞöÖçÇ”) and these are some Turkish words have a non-ascii character = (programcı,şarkı,çalışma,örnek,İnsan,dağ,üs). By the way , cp1254 is valid encoding for Turkish characters.
What can i do for solve this problem? You know, it isnt for only Turkish.
Examples;
Turkish >>> English: okyanus
[[["ocean","okyanus","",""]],[["isim",["ocean","brine","the deep","main","drink"],[["ocean",["okyanus","derya"]],["brine",["tuzlu su","salamura","deniz","okyanus"]],["the deep",["deniz","okyanus","enginler"]],["main",["ana boru","deniz","kuvvet","zor","okyanus","horoz dövüşü"]],["drink",["içmek","içki","içecek","içki içmek","deniz","okyanus"]]]],["sıfat",["oceanic"],[["oceanic",["okyanus","okyanusta bulunan","okyanus gibi"]]]]],"tr",,[["ocean",[5],1,0,999,0,1,0]],[["okyanus",4,,,""],["okyanus",5,[["ocean",999,1,0],["oceanic",0,1,0],["the ocean",0,1,0],["oceans",0,1,0]],[[0,7]],"okyanus"]],,,[["tr"]],2]
okyanus >>> ocean
That was successful.
Turkish >>> English: dağ
[[["daÄ\u0178","daÄ\u0178","",""]],,"tr",,[["daÄ\u0178",[5],1,0,1000,0,1,0]],[["daÄ\u0178",5,[["daÄ\u0178",1000,1,0]],[[0,4]],"daÄ\u0178"]],,,[["tr"]],8]
dağ >>> daÄ\u0178
Fail!
Looking more closely at this you have a bunch of errors and incorrect assumptions. Like
Yes, that’s true, but there are others, like ISO 8859-9, which is an actual international standard not only used by Microsoft. And of course UTF-8/16/32.
Also, not only are you using CP1254 without checking if that’s really the decoding Google uses (it is not), you don’t send the word in the right encoding. I missed that on my first read through, because your question is focused on what you get back. It’s not until the second read-through I realize your main problem is actually that the translation FAILS when you have a non-ascii character.
You are also sending one character (ğ) and getting two back, which is why I assumed it was UTF8 that was the problem, and it is, but not as I first thought.
Since you send it through a HTTP GET, you have to encode the text in the URL, and that means you basically have to use UTF-8. But your GET doesn’t say that. There’s nothing in your request that says you are using UTF-8. Now, you should really set some reader to do this, but that’s complicated, and Google translate allows you to cheat. You can pass in the
ieparameter, saying what in-encoding you have.If you don’t do that it will likely fall back to ISO-8859-1, which is standard in these cases. That will take the two bytes you send for ğ and assume they are two different characters, which is why you get the two characters back.
Then lastly, you should look at the headers to see what encoding Google uses for the response. But here you can also cheat, and tell Google what encoding to use, with the
oeparameter.So if you change:
To:
(Because seriously, you don’t have to stick everything into one long line)
And change:
It will work.