DATA AND REQUIREMENTS
The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.
myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")
> head(myMatrix1) # this is the old data
sampleIDs begin finish
1 19990224 4 5
2 20000224 5 6
3 20010203 6 8
4 20019024 29 30
5 20020201 51 52
> head(myMatrix2) # this is the new data
begin finish sedimentation boreWidth
1 0 10 1.002455 0.014354
2 11 367 2.094351 0.056431
3 368 920 0.450275 0.154105
4 921 1414 2.250820 1.004353
5 1415 5278 0.114109 NA`
Desired output:
> head(myMatrix6)
sampleIDs begin finish sedimentation #myMatrix4
1 19990224 4 5 1.002455
2 20000224 5 6 1.002455
3 20010203 6 8 2.094351
4 20019024 29 30 2.094351
5 20020201 51 52 2.094351
> head(myMatrix7)
sampleIDs begin finish boreWidthThresh #myMatrix5
1 19990224 4 5 FALSE
2 20000224 5 6 FALSE
3 20010203 6 8 FALSE
4 20019024 29 30 FALSE
5 20020201 51 52 FALSE`
CODE
The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
EDIT:
> dput(head(myMatrix2)
structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0",
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L,
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278",
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L,
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455",
"2.094351", "2.250820", "sedimentation"), class = "factor"),
V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354",
"0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(myMatrix1)
structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224",
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"),
V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29",
"4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L,
2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8",
"finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")
First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656
The first thing that jumps out at me is that I’d create only one results matrix. That way you’re not duplicating the
sampleIDs begin finishcolumns, and you can avoid any overhead that comes with running the matching algorithm twice.Doing that, you can avoid selecting more than once (although it’s trivial in terms of speed as long as you store your selection vector rather than re-calculate).
Here’s a solution using
apply:I used
applyand stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.