I know this might sound easy. I thought about using the first dot(.) which comes as the benchmark, but when abbreviations and short forms come, I am rendered helpless.
e.g. –
Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS,
Hon. RA (30 November 1874 – 24 January 1965) was a British politician
and statesman known for his leadership of the United Kingdom during
the Second World War. He is widely regarded as one of the great
wartime leaders and served as Prime Minister twice. A noted statesman
and orator, Churchill was also an officer in the British Army, a
historian, a writer, and an artist.
Here, the 1st dot is Hon., but I want the complete first line ending at Second World War .
Is it possible people ???
If you use
nltkyou can add abbreviations, like this:This approach is based on Kiss & Strunk 2006, which reports that the F-score (harmonic mean of precision and recall) is between 91% and 99% for Punkt, depending on the test corpus.
Kiss, Tibor, and Jan Strunk. 2006. “Unsupervised Multilingual Sentence
Boundary Detection”. Computational Linguistics, (32) 485-525.