I’ve got two word lists, an example: list 1 list 2 foot fuut barj

Question

0

Asked: June 5, 20262026-06-05T00:06:07+00:00 2026-06-05T00:06:07+00:00

I’ve got two word lists, an example: list 1 list 2 foot fuut barj

0

I’ve got two word lists, an example:

 list 1  list 2

 foot    fuut
 barj    kijo
 foio    fuau
 fuim    fuami
 kwim    kwami
 lnun    lnun
 kizm    kazm

I’d like to find

o → u # 1 and 3
i → a # 3 and 7
im → ami # 4 and 5

This should be ordered by amount of occurrences, so I can filter the
ones that don’t appear often.

The lists currently consist of 35k words, the calculation should
take about 6h on an average server.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T00:06:08+00:00

My final solution is to use the mosesdecoder. I split the words into
single characters and used them as parallel corpus and used the
extracted model. I compared Sursilvan and Vallader.

export IRSTLM=$HOME/rumantsch/mosesdecoder/tools/irstlm
export PATH=$PATH:$IRSTLM/bin

rm -rf corpus giza.* model
array=("sur" "val")
for i in "${array[@]}"; do
    cp "raw.$i" "splitted.$i"
    sed -i 's/ /@/g' "splitted.$i"
    sed -i 's/./& /g' "splitted.$i"
    add-start-end.sh < "splitted.$i" > "compiled.$i"
    build-lm.sh -i "compiled.$i" -t ./tmp -p -o "compiled.lm.$i"
    compile-lm --text yes "compiled.lm.$i.gz" "compiled.arpa.$i"
done

../scripts/training/train-model.perl --first-step 1 --last-step 5 -root-dir . -corpus splitted -f sur -e val -lm 0:3:$PWD/compiled.arpa.sur -extract-options "--SentenceId" -external-bin-dir ../tools/bin/

$HOME/rumantsch/mosesdecoder/scripts/../bin/extract $HOME/rumantsch/mosesdecoder/rumantsch/splitted.val $HOME/rumantsch/mosesdecoder/rumantsch/splitted.sur $HOME/rumantsch/mosesdecoder/rumantsch/model/aligned.grow-diag-final $HOME/rumantsch/mosesdecoder/rumantsch/model/extract 7 --SentenceId --GZOutput

zcat model/extract.sid.gz | awk -F '[ ][|][|][|][ ]' '$1!=$2{print $1, "|", $2}' | sort | uniq -c | sort -nr | head -n 10 > results

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve got two word lists, an example: list 1 list 2 foot fuut barj

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply