I want to implement the SRX Segmentation Rules using javascript to extract sentences from text.
In order to do this correctly I will have to follow the SRX rules.
eg. http://www.lisa.org/fileadmin/standards/srx20.html#refTR29
now there are two types of regular expressions
- if found sentence should break like “. “
- if found sentence should not break like abbreviation U.K or Mr.
For this again there are two parts
- before breaking
- after breaking
for example if the rule is
<rule break="no">
<beforebreak>\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
Which says if the pattern “\s*[0-9]+.\s” is found the segment should not break.
how do I implement using javascript, my be split function is not enough ?
You may want to try something like this:
To run this simply call
segment()with the text to split, and the rules XML as a string. For example:The call to
segment()will return an array of sentences, so you can simply do something likealert(segment(...).join('\n'))to see the result.Known Limitations:
All of these limitations seem quite easy to overcome.
How does this work?
The segment function uses the
rulePatternto extract each rule, identify if it is a breaking or non-breaking rule, and create a regexp based on the beforebreak and afterbreak clauses of the rule. It then scans the text, and marks each matching place by adding a unicode character (taken from a unicode private use area) that marks whether it is a break (\uE001) or a non-break (\uE000). If another marker is already positioned in the same place, the rule is not matched, to preserve rule priorities.Then it simply removes the non-break marks, and splits the text according to the break marks.
@Sourabh: I hope this is still relevant for you.