More specifically, I’m trying to check if given string (a sentence) is in Turkish.
I can check if the string has Turkish characters such as Ç, Ş, Ü, Ö, Ğ etc. However that’s not very reliable as those might be converted to C, S, U, O, G before I receive the string.
Another method is to have the 100 most used words in Turkish and check if the sentence includes any/some of those words. I can combine these two methods and use a point system.
What do you think is the most efficient way to solve my problem in Python?
Related question: (human) Language of a document (Perl, Google Translation API)
One option would be to use a Bayesian Classifier such as Reverend. The Reverend homepage gives this suggestion for a naive language detector:
Training with more complex token sets would strengthen the results. For more information on Bayesian classification, see here and here.