Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6607729
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T19:32:03+00:00 2026-05-25T19:32:03+00:00

I’m looking for a efficient data structure/algorithm for storing and searching transliteration based word

  • 0

I’m looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I’m not trying to use google transliteration API). Unfortunately, the natural language I’m trying to work on doesn’t have any soundex implemented, so I’m on my own.

For an open source project currently I’m using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I’m afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.

There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?

Thanks in advance.


Edit: If I understand correctly, this is suggested by @Dialecticus-

I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.

Lets assume phonetically here is our associative arrays/transliteration table:

a -> p, q
b -> r
c -> x, y, z

We also have a valid word lists in plain arrays for Language2:

...
px
qy
...

If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.


What I’m doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I’m building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.

I’m eager to know if there is any better method than that.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T19:32:03+00:00Added an answer on May 25, 2026 at 7:32 pm

    From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S’ which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S’1,S’2,…,S’n] of output strings which might potentially be equivalent to S.

    One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have

    px->ac
    qy->ac
    pr->ab
    

    (I have added an extra valid word pr for clarity)

    Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.

    Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.

    Thus for our example, the Trie would look something like this

                                      Root (empty)
                                        | a
                                        |
                                        V
                                  +---Node (empty)---+
                                  | b                | c
                                  |                  |
                                  V                  V
                               Node (px,qy)         Node (pr)      
    

    Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.

    Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.

    Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input – i.e. a prefix match. For example, if fed with the input symbol ‘a’, we can use the trie to find all valid words that can begin with ‘a’ (‘px’,’qr’,’py’). But doing that is not as fast as finding the exact match.

    Here’s a quick hack at a solution (in Java):

    import java.util.*;
    
    class TrieNode{
        // child nodes - size of array depends on your alphabet size,
        // her we are only using the lowercase English characters 'a'-'z'
        TrieNode[] next=new TrieNode[26];
        List<String> words;
    
        public TrieNode(){
            words=new ArrayList<String>();
        }
    }
    
    class Trie{
        private TrieNode root=null;
    
        public void addWord(String sourceLanguage, String targetLanguage){
            root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
        }
    
        private static int convertToIndex(char c){ // you need to change this for your alphabet
            return (c-'a');
        }
    
        private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
            if (cur==null){
                cur=new TrieNode();
            }
            if (s.length==pos){
                cur.words.add(targ);
            }
            else{
    
                cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
            }
            return cur;
        }
    
        public List<String> findMatches(String text){
            return find(root,text.toCharArray(),0);
    
        }
    
        private List<String> find(TrieNode cur, char[] s, int pos){
            if (cur==null) return new ArrayList<String>();
            else if (pos==s.length){
                return cur.words;
            }
            else{
                return find(cur.next[convertToIndex(s[pos])],s,pos+1);
            }
        }
    }
    
    class MyMiniTransliiterator{
        public static void main(String args[]){
            Trie t=new Trie();
            t.addWord("ac","px");
            t.addWord("ac","qy");
            t.addWord("ab","pr");
    
            System.out.println(t.findMatches("ac")); // prints [px,qy]
            System.out.println(t.findMatches("ab")); // prints [pr]
            System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
        }
    }
    

    This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some data like this: 1 2 3 4 5 9 2 6
I have a jquery bug and I've been looking for hours now, I can't
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
I've got a string that has curly quotes in it. I'd like to replace
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I'm making a simple page using Google Maps API 3. My first. One marker
I need to clean up various Word 'smart' characters in user input, including but
Configuring TinyMCE to allow for tags, based on a customer requirement. My config is
That's pretty much it. I'm using Nokogiri to scrape a web page what has

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.