I would like to use R to extract the speaker out of scripts formatted like in the following example:
“Scene 6: Second Lord: Nay, good my lord, put him to’t; let him have his way. First Lord: If your lordship find him not a hilding, hold me no more in your respect. Second Lord: On my life, my lord, a bubble. BERTRAM: Do you think I am so far deceived in him? Second Lord: Believe it, my lord, in mine own direct knowledge, without any malice, but to speak of him as my kinsman, he’s a most notable coward, an infinite and endless liar, an hourly promise-breaker, the owner of no one good quality worthy your lordship’s entertainment.”
In this example, I would like to extract: (“Second Lord”, “First Lord”, “Second Lord”, “BERTRAM”, “Second Lord”). The rule is obvious: it is the group of words situated between the end of a sentence and a semi-column.
How can I write this in R ?
Maybe something like this:
Explanations of regex: (which are not perfect)
str_extract_all(body, "[:.?] [A-z ]*:")a match is started with either:or.or?([:.?]) followed by a whitespace. Any character and whitespace is matched until the next:.Get position
You can use
str_locate_allwith the same regex: