I’d like to edit the addresses of strings such as this example:
test = c("[Mavlyanova, Nadira G.] Uzbek Acad Sci, GA Mavlyanov Inst Seismol, Tashkent 700135, Uzbekistan; [Markovic, Slobodan B.] Univ Novi Sad, Fac Sci, Chair Phys Geog, Novi Sad 21000, Serbia; [Rowell, G.] Univ Adelaide, Sch Chem & Phys, Adelaide, SA 5005, Australia; [Katarzynski, K.] Nicholas Copernicus Univ, Torun Ctr Astron, PL-87100 Torun, Poland; [Ansari, Z.; Boettcher, M.; Manschwetus, B.; Rottke, H.; Sandner, W.] Max Born Inst, D-12489 Berlin, Germany; [Milosevic, D. B.] Univ Sarajevo, Fac Sci, Sarajevo 71000, Bosnia & Herceg")
I’d like to get only the country names. This is what I tried so far:
> testa <- gsub("\\[.*?\\] ", "", test) #remove square brackets
> testa <- strsplit(testa, ";", fixed = TRUE) #split adresses
> testa <- sapply(testa, function(x) gsub("^.*, ([A-Za-z ]*)$", "\\1", x)) #keep only what's after last comma
> testa <- gsub("^ | $", "", testa) #remove spaces
> testa
[,1]
[1,] "Uzbekistan"
[2,] "Serbia"
[3,] "Australia"
[4,] "Poland"
[5,] "Germany"
[6,] "Univ Sarajevo, Fac Sci, Sarajevo 71000, Bosnia & Herceg"
So this doesn’t work for the last address, unfortunately. I’d like to get the following output instead:
> testa
[,1]
[1,] "Uzbekistan"
[2,] "Serbia"
[3,] "Australia"
[4,] "Poland"
[5,] "Germany"
[6,] "Bosnia & Herceg"
My questions are:
- What’s the error in my sapply-function which prevents it from correctly working with the last address as well?
- How can I improve it in order to achieve the correct output?
The problem with your code is that the “everything after the last comma” part of your code uses
[A-Za-z ]as the only valid characters after that. This set does not include&, hence the replacement isn’t performed on the last address. Perhaps you should use[^,]to denote “Anything but a comma” instead.