DATA AND REQUIREMENTS The first table ( myMatrix1 ) is from an old geological

Question

0

Asked: May 27, 20262026-05-27T02:51:18+00:00 2026-05-27T02:51:18+00:00

DATA AND REQUIREMENTS The first table ( myMatrix1 ) is from an old geological

0

DATA AND REQUIREMENTS

The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.

myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")

> head(myMatrix1)  # this is the old data

    sampleIDs begin finish   
1    19990224 4     5 
2    20000224 5     6 
3    20010203 6     8 
4    20019024 29    30 
5    20020201 51    52 

> head(myMatrix2)   # this is the new data

     begin finish  sedimentation    boreWidth
1    0     10       1.002455        0.014354
2    11    367      2.094351        0.056431
3    368   920      0.450275        0.154105
4    921   1414     2.250820        1.004353
5    1415  5278     0.114109        NA`

Desired output:

> head(myMatrix6)

    sampleIDs begin finish  sedimentation #myMatrix4
1    19990224 4     5       1.002455
2    20000224 5     6       1.002455
3    20010203 6     8       2.094351
4    20019024 29    30      2.094351
5    20020201 51    52      2.094351

> head(myMatrix7)

    sampleIDs begin finish  boreWidthThresh #myMatrix5
1    19990224 4     5       FALSE
2    20000224 5     6       FALSE
3    20010203 6     8       FALSE
4    20019024 29    30      FALSE
5    20020201 51    52      FALSE`

CODE

The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?

# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]

# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]

for (i in 1:nrow(myMatrix2))
{       
    # create matrix that has the value of myMatrix1$begin being 
    # situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
    myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin <      myMatrix2$finish[i])),]

    myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))

    if (is.na(myMatrix2$boreWidth[i])) {
        myMatrix5 <- rep(NA, nrow(myMatrix3))
    }
    else if (myMatrix2$boreWidth[i] == 0) {
    myMatrix5 <- rep(TRUE, nrow(myMatrix3))
    }
    else if (myMatrix2$boreWidth[i] > 0) {
    myMatrix5 <- rep(FALSE, nrow(myMatrix3))
    }

    myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
    myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}

EDIT:

> dput(head(myMatrix2)

structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0", 
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L, 
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278", 
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L, 
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455", 
"2.094351", "2.250820", "sedimentation"), class = "factor"), 
    V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354", 
    "0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1", 
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")

> dput(head(myMatrix1)

structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224", 
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"), 
    V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29", 
    "4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L, 
    2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8", 
    "finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T02:51:18+00:00

First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656

The first thing that jumps out at me is that I’d create only one results matrix. That way you’re not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.

Doing that, you can avoid selecting more than once (although it’s trivial in terms of speed as long as you store your selection vector rather than re-calculate).

Here’s a solution using apply:

myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))

glommer <- function(x,myMatrix2) {
  x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
  names(x)[4:5] <- c("sed","boreWidth")
  return( x )
}

> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
     sampleIDs begin finish   sed boreWidth
[1,]  19990224     4      5 1.002     0.014
[2,]  20000224     5      6 1.002     0.014

I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.

This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

DATA AND REQUIREMENTS The first table ( myMatrix1 ) is from an old geological

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply