I’m working on a system which allows imported files to be localized into other languages.
This is mostly a private project to get the hang of MVC3, EntityFramework, LINQ, etcetera. Therefore I like doing some crazy things to spice up the end result, one of those things would be the recognition of similar strings.
Imagine you have the following list of strings – borrowed from a game I’ve worked with in the past:
- Megabeth: Holy Roller Uniform – Includes Head, Torso, and Legs
- Megabeth: Holy Roller Uniform Head
- Megabeth: Holy Roller Uniform Legs
- Megabeth: Holy Roller Uniform Torso
- Megabeth: PAX East 2012 Uniform – Includes Head, Torso, and Legs
- Megabeth: PAX East 2012 Uniform Head
- Megabeth: PAX East 2012 Uniform Legs
- Megabeth: PAX East 2012 Uniform Torso
As you can see, once users have translated the first 4 strings, the following 4 share a lot of similarities, in this case:
- Megabeth
- Uniform
- Includes Head, Torso, and Legs
- Head
- Legs
- Torso
Consider the first 4 strings are indeed already translated, when a user selects the 5th string from the list, what kind of algorithm or technique can I use to show the user the 1st string (and potentially others) under a sub-header of “Similar strings”?
Edit – A little comment on the Levenshtein Distance:
I’m currently targeting 10k strings in the database. Levenshtein Distance compares string per string, so in this case 10k x (10k -1) possible combinations. How would I approach this in a feasible way? Is there a better solution that this particular algorithm?
You could look into the Levenshtein Distance. Those below a certain threshold will be considered similar. Two strings that are identical will have a distance of zero.
There’s a C# implementation, amongst other languages, on Rosetta Code.