I want to implement the SRX Segmentation Rules using javascript to extract sentences from

Question

0

Editorial Team

Asked: May 16, 20262026-05-16T06:50:45+00:00 2026-05-16T06:50:45+00:00

I want to implement the SRX Segmentation Rules using javascript to extract sentences from

0

I want to implement the SRX Segmentation Rules using javascript to extract sentences from text.

In order to do this correctly I will have to follow the SRX rules.

eg. http://www.lisa.org/fileadmin/standards/srx20.html#refTR29

now there are two types of regular expressions

if found sentence should break like “. “
if found sentence should not break like abbreviation U.K or Mr.

For this again there are two parts

before breaking
after breaking

for example if the rule is

<rule break="no">

    <beforebreak>\s*[0-9]+\.</beforebreak>
    <afterbreak>\s</afterbreak>

</rule>

Which says if the pattern “\s*[0-9]+.\s” is found the segment should not break.

how do I implement using javascript, my be split function is not enough ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T06:50:45+00:00

You may want to try something like this:

function segment(text, rules) {
    if (!text) return text;
    if (!rules) return [text];

    var rulePattern = /<rule(?:(\s+break="no")|\s+[^>]+|\s*)>(?:<beforebreak>([^<]+)<\/beforebreak>)?(?:<afterbreak>([^<]+)<\/afterbreak>)?<\/rule>/g;
    cleanXml(rules).replace(rulePattern, 
        function(whole, nobreak, before, after) {
            var r = new RegExp((before||'')+'(?![\uE000\uE001])'+(after?'(?='+after+')':''), 'mg');
            text = text.replace(r, nobreak ? '$&\uE000' : '$&\uE001');
            return '';
        }
    );

    var sentences = text.replace(/\uE000/g, '').split(/\uE001/g);

    return sentences;
}

function cleanXml(s) {
    return s && s.replace(/<!--[\s\S]*?-->/g,'').replace(/>\s+</g,'><');
}

To run this simply call segment() with the text to split, and the rules XML as a string. For example:

segment('The U.K. Prime Minister, Mr. Blair, was seen out with his family today.',
        '<rule break="no">' +
            '<beforebreak>\sMr\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="no">' +
            '<beforebreak>\sU\.K\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="yes">' +
            '<beforebreak>[\.\?!]+</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>'
);

The call to segment() will return an array of sentences, so you can simply do something like alert(segment(...).join('\n')) to see the result.

Known Limitations:

It expects the rules to be after the cascading process that is relevant for the specific language.
It expects the regular expressions used by the rules to conform to the javascript regexp syntax.
It does not handle internal markup.

All of these limitations seem quite easy to overcome.

How does this work?

The segment function uses the rulePattern to extract each rule, identify if it is a breaking or non-breaking rule, and create a regexp based on the beforebreak and afterbreak clauses of the rule. It then scans the text, and marks each matching place by adding a unicode character (taken from a unicode private use area) that marks whether it is a break (\uE001) or a non-break (\uE000). If another marker is already positioned in the same place, the rule is not matched, to preserve rule priorities.

Then it simply removes the non-break marks, and splits the text according to the break marks.

@Sourabh: I hope this is still relevant for you.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to implement the SRX Segmentation Rules using javascript to extract sentences from

Leave an answerCancel reply

1 Answer

Known Limitations:

How does this work?

Leave an answer
Cancel reply