I’m interested in writing a PHP script (I do welcome language-agnostic suggestions) that would transliterate a sentence or word written in English (phoenetically) into the script of another language. Since I’m looking at English written phoenetically (i.e. by ear): I’d have to deal with variant spellings of the same word.
It is assumed that no standard exists for romanization (for instance, in Chinese, you have the Simplified Wade, etc.)
Does anyone have any advice on where I could start?
EDIT: I’m doing this purely for educational purposes, and I was initially under the impression that in order to figure out the connection between variant spellings (which could be found in a corpus of IM messages, Facebook posts written in the romanized form of the language), you’d need some sort of machine learning tool. However, I’d like to know if I was on the right track, and I’d like some help in figuring out what next I should look into to get this working (for instance: which machine learning tool should I look into?).
I know with Japanese at least, you have a set number of letter combinations.
So, you could do something like create a matching array like this
Of course, continuing on, and making sure you don’t match ‘su’, when it should be ‘tsu’.
This would only be a starting point, of course.
Machine learning is probably most practical with Chinese…but here’s a rough start to hiragana: https://gist.github.com/1154969