I have a vector with names, e.g.:
names <- "Jansen, A., Karel, A., Jong, A. de, Pietersen, K."
And I want to split this per name. In this case, I need to split the vector on ., and the comma following de (That name would be A. De Jong, which is typical in Dutch).
Right now I do:
strsplit(names,split="\\.\\,|\\<de\\>,")
But this also removes the de from the name:
[[1]]
[1] "Jansen, A" " Karel, A" " Jong, A. " " Pietersen, K."
How can I obtain the following as result?
[[1]]
[1] "Jansen, A" " Karel, A" " Jong, A. de" " Pietersen, K."
polishchuk’s regex needs two modifications to make it work in R.
Firstly, the backslash needs escaping. Secondly, the call to
strsplitneeds the argumentperl = TRUEto enable lookbehind.gives the answer Sacha asked for.
Notice though that this still includes a dot in de Jong’s name, and it isn’t extensible to alternatives like van, der, etc. I suggest the following alternative.