Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7920015
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 3, 20262026-06-03T16:03:51+00:00 2026-06-03T16:03:51+00:00

One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European

  • 0

One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European languages (list at bottom).

Say you want to build some data structure(s) to implement a multi-language dictionary for let’s say the top-N (N~40) European languages on the internet, ranking choice of language by number of webpages (rough list of languages given at bottom of this question).
The aim is to store the working vocabulary of each language (i.e. 25,000 words for English etc.) Proper nouns excluded. Not sure whether we store plurals, verb conjugations, prefixes etc., or add language-specific rules on how these are formed from noun singulars or verb stems.
Also your choice on how we encode and handle accents, diphthongs and language-specific special characters e.g. maybe where possible we transliterate things (e.g. Romanize German ß as ‘ss’, then add a rule to convert it). Obviously if you choose to use 40-100 characters and a trie, there are way too many branches and most of them are empty.

Task definition: Whatever data structure(s) you use, you must do both of the following:

  1. The main operation in lookup is to quickly get an indication ‘Yes this is a valid word in languages A,B and F but not C,D or E’. So, if N=40 languages, your structure quickly returns 40 Booleans.
  2. The secondary operation is to return some pointer/object for that word (and all its variants) for each language (or null if it was invalid). This pointer/object could be user-defined e.g. the Part-of-Speech and dictionary definition/thesaurus similes/list of translations into the other languages/… It could be language-specific or language-independent e.g. a shared definition of pizza)

And the main metric for efficiency is a tradeoff of a) compactness (across all N languages) and b) lookup speed. Insertion time not important. The compactness constraint excludes memory-wasteful approaches like “keep a separate hash for each word” or “keep a separate for each language, and each word within that language”.

So:

  1. What are the possible data structures, how do they rank on the
    lookup speed/compactness curve?
  2. Do you have a unified structure for all N languages, or partition e.g. the Germanic languages into one sub-structure, Slavic into
    another etc? or just N separate structures (which would allow you to
    Huffman-encode )?
  3. What representation do you use for characters, accents and language-specific special characters?
  4. Ideally, give link to algorithm or code, esp. Python or else C. –

(I checked SO and there have been related questions but not this exact question. Certainly not looking for a SQL database. One 2000 paper which might be useful: “Estimation of English and non-English Language Use on the WWW” – Grefenstette & Nioche. And one list of multi-language dictionaries)
Resources: two online multi-language dictionaries are Interglot (en/ge/nl/fr/sp/se) and LookWayUp (en<->fr/ge/sp/nl/pt).


Languages to include:

Probably mainly Indo-European languages for simplicity: English, French, Spanish, German, Italian, Swedish + Albanian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Icelandic, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo Croat, Slovak, Slovenian + Breton, Catalan, Corsican, Esperanto, Gaelic, Welsh

Probably include Russian, Slavic, Turkish, exclude Arabic, Hebrew, Iranian, Indian etc. Maybe include Malay family too. Tell me what’s achievable.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-03T16:03:53+00:00Added an answer on June 3, 2026 at 4:03 pm

    I will not win points here, but some things.

    A multi-language dictionary is a large and time-consuming undertaking. You did not talk in detail about the exact uses for which your dictionary is intended: statistical probably, not translating, not grammatical, …. Different usages require different data to be collected, for instance classifying “went” as passed tense.

    First formulate your first requirements in a document, and with a programmed interface prototype. Asking data structures before algorithmic conception I see often for complex business logic. One would then start out wrong, risking feature creep. Or premature optimisation, like that romanisation, which might have no advantage, and bar bidrectiveness.

    Maybe you can start with some active projects like Reta Vortaro; its XML might not be efficient, but give you some ideas for organisation. There are several academic linguistic projects. The most relevant aspect might be stemming: recognising greet/greets/greeted/greeter/greeting/greetings (@smci) as belonging to the same (major) entry. You want to take the already programmed stemmers; they often are well-tested and already applied in electronic dictionaries. My advise would be to research those projects without losing to much energy, impetus, to them; just enough to collect ideas and see where they might be used.

    The data structures one can think up, are IMHO of secondary importance. I would first collect all in a well defined database, and then generate the software used data structures. You can then compare and measure alternatives. And it might be for a developer the most interesting part, creating a beautiful data structure & algorithm.


    An answer

    Requirement:

    Map of word to list of [language, definition reference].
    List of definitions.

    Several words can have the same definition, hence the need for a definition reference.
    The definition could consist of a language bound definition (grammatical properties, declinations), and/or a language indepedendant definition (description of the notion).

    One word can have several definitions (book = (noun) reading material, = (verb) reserve use of location).

    Remarks

    As single words are handled, this does not consider that an occuring text is in general mono-lingual. As a text can be of mixed languages, and I see no special overhead in the O-complexity, that seems irrelevant.

    So a over-general abstract data structure would be:

    Map<String /*Word*/, List<DefinitionEntry>> wordDefinitions;
    Map<String /*Language/Locale/""*/, List<Definition>> definitions;
    
    class Definition {
        String content;
    }
    
    class DefinitionEntry {
        String language;
        Ref<Definition> definition;
    }
    

    The concrete data structure:

    The wordDefinitions are best served with an optimised hash map.


    Please let me add:

    I did come up with a concrete data structure at last. I started with the following.

    Guava’s MultiMap is, what we have here, but Trove‘s collections with primitive types is what one needs, if using a compact binary representation in core.

    One would do something like:

    import gnu.trove.map.*;
    
    /**
     * Map of word to DefinitionEntry.
     * Key: word.
     * Value: offset in byte array wordDefinitionEntries,
     * 0 serves as null, which implies a dummy byte (data version?)
     * in the byte arrary at [0].
     */
    TObjectIntMap<String> wordDefinitions = TObjectIntHashMap<String>();
    byte[] wordDefinitionEntries = new byte[...]; // Actually read from file.
    
    void walkEntries(String word) {
        int value = wordDefinitions.get(word);
        if (value == 0)
            return;
        DataInputStream in = new DataInputStream(
            new ByteArrayInputStream(wordDefinitionEntries));
        in.skipBytes(value);
        int entriesCount = in.readShort();
        for (int entryno = 0; entryno < entriesCount; ++entryno) {
            int language = in.readByte();
            walkDefinition(in, language); // Index to readUTF8 or gzipped bytes.
        }
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

OK, the title is a bit(lot) cryptic, but that's the best one-line-summary I could
One-line summary: What is the best practice for unhooking event handlers created in the
I'm looking for one line code examples in various languages for getting a valid
i want to remove one line in a txt file after i've gotten the
I have one line of code which seems commented. Basically the thing I want
I have one line (two point (x,y) (x1,y1)) and a rectangle with focus point
Is there a one-line easy linq expression to just get everything from a simple
This is one line of the input file: FOO BAR 0.40 0.20 0.40 0.50
I'm looking to do this with one line of code. var a = '';
I am attempting to write a one-line Perl script that will toggle a line

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.