I am looking for a good .NET regular expression that I can use for parsing out individual sentences from a body of text.
It should be able to parse the following block of text into exactly six sentences:
Hello world! How are you? I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause
sentence breaks, like 1.23.
This is proving a little more challenging than I originally thought.
Any help would be greatly appreciated. I am going to use this to train the system on known bodies of text.
Try this
@"(\S.+?[.!?])(?=\s+|$)":Results:
For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.
Here is the SharpNLP info, and features: