I’m struggling with the problem to cut the very first sentence from the string.
It wouldn’t be such a problem if I there were no abbreviations ended with dot.
So my example is:
- string = ‘I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.’
And the result should be:
- result = ‘I like cheese, cars, etc. but my the most favorite website is stackoverflow.’
Normally I would do with:
re.findall(r'^(\s*.*?\s*)(?:\.|$)', event)
but I would like to skip some pre-defined words, like above mentioned etc.
I came with couple of expression but none of them worked.
You could try NLTK’s Punkt sentence tokenizer, which does this kind of thing using a real algorithm to figure out what the abbreviations are instead of your ad-hoc collection of abbreviations.
NLTK includes a pre-trained one for English; load it with:
From the source code: