I need to remove all words before the dash at the beginning of each sentence. Some sentences do not have words before dashes and dashes within the long sentence need to stay. Here is an example:
How do I change these strings:
PARIS — President Nicolas Sarkozy, running from behind for
reelection…GAZA CITY —Cross-border fighting between Gaza and Israel…
CARURU, Colombia — Quite suddenly, the endless green of Amazonian
forest…A year after an earthquake and tsunami devastated Japan’s northeastern
coast…
Into these strings:
President Nicolas Sarkozy, running from behind for
reelection…Cross-border fighting between Gaza and Israel…
Quite suddenly, the endless green of Amazonian
forest…A year after an earthquake and tsunami devastated Japan’s northeastern
coast…
How can I accomplish this with javascript (or php if javascript doesn’t allow it)?
This is a pretty straightforward regex problem, but geez, it’s not as straightforward as all the other answers assume. A few points:
Regex is the right choice – the
splitandsubstranswers won’t deal with the leading space, and can’t distinguish between a dateline with a dash at the beginning of a sentence, and a dash in the middle of your text content. Any option you use ought to be able to deal with content like:"President Nicolas Sarkozy — running from behind for reelection — came to Paris today..."as well as the options you suggest.It’s tricky to automatically recognize that my test sentence above doesn’t have a dateline. Almost all the answers so far use the single description:
any number of arbitrary characters, followed by a dash. That’s insufficient for a test sentence like the one above.You’ll get better results by adding a few more rules, like
fewer than X characters, located at the beginning of the string, followed by a dash, optionally followed by an arbitrary number of spaces, followed by a capital letter. Even this won’t work correctly with"President Sarkozy — Carla Bruni's husband...", but you’re going to have to assume that this edge case is sufficiently rare to ignore.All of which gives you a function like this:
Breaking it down:
^– must occur at the beginning of the string.[^—]{3,75}– between 3 and 75 characters other than a dash\s*– optional spacesUsage: