I have a string which is a fragment of a book (its around 1 chapter)
this string is all one line.
I would like to make a new line at the end of each sentence
I solved it by a not-so-sophisticated code of
text = text.replaceAll("\\.","\\.\n"); //same for ? same for !
and of course this does not yield very nice results.
I dont need this to be perfect but the nicer i can get it the better.
I would like at least to check for following before making a new line character:
the word before the . is longer then 2 characters
there are no dots before the . in the same "word"
the character before the . is not a number
the character after the dot (and possibly a whitespace after that dot) is not a (
Any other suggestions would be really appreciated, along with actual code which will make it happen.
Similar question:
Here
Update:
Although not high on my list of priorities because my book doesnt contain a lot of direct quotations nor direct speeches but a rule that handles sentences that are inside those would also be in order so that sentences from the same qoute dont end up on new lines
Stanford’s CoreNLP toolkit has a class that does sentence segmentation. See more here.
If you say
new DocumentPreprocessor(new StringReader(s)).iterator()wheresis a string containing the text, it will give you back an iterator of sentences.Note that this will tokenize the sentence as well. If you want the sentence to look the way it started, you can either just use this output as a guide for splitting, or run the
PTBTokenizer -untokcommand (see same link as above) to make each tokenized sentence look normal again.This will almost certainly work better than your list of rules since your rules don’t account for many of the important cases.