I am new in machine learning and computing probabilities. This is an example from

Question

0

Asked: June 3, 20262026-06-03T21:45:50+00:00 2026-06-03T21:45:50+00:00

I am new in machine learning and computing probabilities. This is an example from

0

I am new in machine learning and computing probabilities. This is an example from Lingpipe for adding syllabification in a word by training data.

Given a source model p(h) for hyphenated words, and a channel model p(w|h) defined so that p(w|h) = 1 if w is equal to h with the hyphens removed and 0 otherwise. We then seek to find the most likely source message h to have produced message w by:

    ARGMAXh p(h|w) = ARGMAXh p(w|h) p(h) / p(w)
                   = ARGMAXh p(w|h) p(h)         
                   = ARGMAXh s.t. strip(h)=w p(h)

where we use strip(h) = w to mean that w is equal to h with the hyphenations stripped out (in Java terms, h.replaceAll(" ","").equals(w)). Thus with a deterministic channel, we wind up looking for the most likely hyphenation h according to p(h), restricting our search to h that produce w when the hyphens are stripped out.

I do not understand how to use it to build a syllabification model.

If there is a training set containing:

a bid jan
a bide
a bie
a bil i ty
a bim e lech

How to have a model that will syllabify words? I mean what to be computed in order to find possible syllable breaks of a new word.

First compute what? then compute what? Can you please be specific with example?

Thanks a lot.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T21:45:52+00:00

The method described in the article is based on a statistical law allowing to compute the correct value observing a noisy value. In other words, non-syllabified word is noisy or incorrect, like picnic, and the goal is finding a probably correct value, which is pic-nic.

Here is an excellent video lesson on very this topic (scroll to 1:25, but the whole set of lectures worth watching).

This method is specifically useful for word delimiting, but some use it for syllabification as well. Chinese language has space delimiters only for logical constructs, but most words follow each other with no delimiters. However, each character is a syllable, no exception.

There are other languages that have more complicated grammar. For instance, Thai has no spaces between the words, but each syllable may be constructed from several symbols, e.g. สวัสดี -> ส-วัส-ดี. Rule-based syllabification may be hard but possible.

As per English, I would not bother with Markov chains and N-grams and instead just use several simple rules that give pretty good match ratio (not perfect, however):

Two consonants between two vowels VCCV – split between them VC-CV as in cof-fee, pic-nic, except the “cluster consonant” that represents a single sound: meth-od, Ro-chester, hang-out
Three or more consonants between the vowels VCCCV – split keeping the blends together as in mon-ster or child-ren (this seems the most difficult as you cannot avoid a dictionary)
One consonant between two vowels VCV – split after the first vowel V-CV as in ba-con, a-rid
The rule above also has an exception based on blends: cour-age, play-time
Two vowels together VV – split between, except they represent a “cluster vowel”: po-em, but glacier, earl-ier

I would start with the “main” rules first, and then cover them with “guard” rules preventing cluster vowels and consonants to be split. Also, there would be an obvious guard rule to prevent a single consonant to become a syllable. When done, I would have added another guard rule based on a dictionary.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new in machine learning and computing probabilities. This is an example from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply