I’m struggling with the problem to cut the very first sentence from the string.

Question

0

Asked: June 1, 20262026-06-01T08:51:20+00:00 2026-06-01T08:51:20+00:00

I’m struggling with the problem to cut the very first sentence from the string.

0

I’m struggling with the problem to cut the very first sentence from the string.
It wouldn’t be such a problem if I there were no abbreviations ended with dot.

So my example is:

string = ‘I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.’

And the result should be:

result = ‘I like cheese, cars, etc. but my the most favorite website is stackoverflow.’

Normally I would do with:

re.findall(r'^(\s*.*?\s*)(?:\.|$)', event)

but I would like to skip some pre-defined words, like above mentioned etc.

I came with couple of expression but none of them worked.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T08:51:22+00:00

You could try NLTK’s Punkt sentence tokenizer, which does this kind of thing using a real algorithm to figure out what the abbreviations are instead of your ad-hoc collection of abbreviations.

NLTK includes a pre-trained one for English; load it with:

nltk.data.load('tokenizers/punkt/english.pickle')

From the source code:

>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences 
can start with non-capitalized words.
-----
i is a good variable
name.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m struggling with the problem to cut the very first sentence from the string.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply