Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8508579
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T03:12:00+00:00 2026-06-11T03:12:00+00:00

I have a text corpus which is already aligned at sentence level by construction

  • 0

I have a text corpus which is already aligned at sentence level by construction – it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 – 20 words each and their translations. My goal is to try to build a metric of the quality of the translation – automatically of course, because I’m dealing with languages I know nothing about 🙂

I’d like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I’m hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says “Store” is to be tranlated into French by “Magasin” then if I spot some place where “Store” is translated as “Boutique” I can suspect that something is wrong.

So I’d need to:

  1. build a dictionary from my corpus
  2. align the words inside the string/translation pairs

Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level…

Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!

Thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T03:12:02+00:00Added an answer on June 11, 2026 at 3:12 am

    A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.

    You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:

    1. Create your parallel corpus, sentence-aligned (you seem to have this already)
    2. Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
    3. (Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
    4. Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.

    Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.

    A few final notes:

    • There are more than one possible translations for every word; you shouldn’t rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;

    • One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;

    • GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a rather large text corpus, of which I would like to check
I have a large corpus of text (10 million sentences or so) which I'd
I have text in db which looks like this: <p> blah blah blah </p>
I have text file containing a list of 16 bit hex numbers (e.g. '61C7393AA9B3474DB081C7B7CCE1C545')
I have text file which I want to erase in Python. How do I
I have text links which change colour on hover with CSS and shift left
I have text stored in a variable which contains several span tags. I want
I have text field which has an onTouchListener attached to it. When i touch
I have text file which contains lines as mentioned below <property>PASSWORD_X_1</property> <property>PASSWORD_A_2</property> <property>PASSWORD_B_6</property> so
I have text next to a image (red square for now). When i make

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.