I am new to R and am puzzled by this problem when manipulating some environmental monitoring data.
I have two datasets recording the actual monitoring time-series and the monitoring site information, respectively. I stored them in two data frames monitoring and sites:
monitoring:
date site obs
1 2001-01-01 10:00:00 riverside NA
2 2001-01-01 11:00:00 riverside 52
3 2001-01-01 12:00:00 riverside 52
4 2001-01-01 13:00:00 riverside 56
5 2001-01-01 10:00:00 dorm 52
6 2001-01-01 11:00:00 dorm 64
7 2001-01-01 12:00:00 dorm 76
8 2001-01-01 13:00:00 dorm 80
9 2001-01-01 10:00:00 kfc 78
10 2001-01-01 11:00:00 kfc 74
11 2001-01-01 12:00:00 kfc 66
12 2001-01-01 13:00:00 kfc 68
sites:
site type
1 DORM suburban
2 KFC urban
3 RIVERSIDE rural
I want to add a site.type column in monitoring with information extracted from sites as shown below:
date site obs site.type
1 2001-01-01 10:00:00 riverside NA rural
2 2001-01-01 11:00:00 riverside 52 rural
3 2001-01-01 12:00:00 riverside 52 rural
4 2001-01-01 13:00:00 riverside 56 rural
5 2001-01-01 10:00:00 dorm 52 suburban
6 2001-01-01 11:00:00 dorm 64 suburban
7 2001-01-01 12:00:00 dorm 76 suburban
8 2001-01-01 13:00:00 dorm 80 suburban
9 2001-01-01 10:00:00 kfc 78 urban
10 2001-01-01 11:00:00 kfc 74 urban
11 2001-01-01 12:00:00 kfc 66 urban
12 2001-01-01 13:00:00 kfc 68 urban
I tried grep() in the following command:
for (i in 1:nrow(monitoring)) {
monitoring$site.type[i] <- as.character(sites$type[grep(monitoring$site[i], sites$site, ignore.case = T)])
}
It worked OK on this small example set of monitoring. However, when I applied it to my real dataset with 654,525 records, it never stopped running on my i5-2400 computer with 16 GB RAM…
I tried to search for existing questions on stackoverflow and did find some answers offering the same solution to similar scenarios, so was even more confused why it did not work in my case. Therefore,
- Could someone kindly point out where the problem is?
- May I ask how to avoid
forlooping in this case, as it may
not be as “fashionable” and efficient? 🙂
Many thanks in advance.
Proper way to do it is to use
merge, as Ben suggested, but here is a simple trick:Now you can get access
sitesusing keys such asriverside, for example trysites[ "riverside", ]. Thetolower()function is used only to turnRIVERSIDEintoriverside. Therefore, you can do