I am given two very large data sets and I’ve been trying to build

Question

0

Asked: May 24, 20262026-05-24T23:47:47+00:00 2026-05-24T23:47:47+00:00

I am given two very large data sets and I’ve been trying to build

0

I am given two very large data sets and I’ve been trying to build a function that would find certain coordinates from one set that respect an if clause regarding the other data set. My problem is that the function I wrote is very slow and although I’ve been reading answers to questions similar in some way, I haven’t managed to make it work.
So if I am given:

>head(CTSS)    
    V1     V2     V3
1 chr1 564563 564598 
2 chr1 564620 564649
3 chr1 565369 565404
4 chr1 565463 565541
5 chr1 565653 565697
6 chr1 565861 565922

and

> head(href)
   chr      region    start      end strand nu   gene_id transcript_id
1 chr1 start_codon 67000042 67000044      +  . NM_032291     NM_032291
2 chr1         CDS 67000042 67000051      +  0 NM_032291     NM_032291
3 chr1        exon 66999825 67000051      +  . NM_032291     NM_032291
4 chr1         CDS 67091530 67091593      +  2 NM_032291     NM_032291
5 chr1        exon 67091530 67091593      +  . NM_032291     NM_032291
6 chr1         CDS 67098753 67098777      +  1 NM_032291     NM_032291

For each value in the start column from the href data set I want to find the first two values in the 3rd column of the CTSS data set smaller or equal than it and keep it in a new dataframe.
The loop I wrote:

y <- CTSS[order(-CTSS$V3), ]     
find_CTSS <- function(x, y) {
    n <- length(x$start)
    foo <- data.frame(matrix(0, n, 6))
    for (i in 1:n)
    {
        a <- which(y$V3 <= x$start[i])
        foo[i, ] = c(x$start[i], x$stop[i], y$V2[a[1]], y$V3[a[1]] , y$V2[a[2]], y$V3[a[2]])
    }

print(foo)

}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T23:47:47+00:00

You provide little data (but see here) so it’s a bit hard to benchmark your solution. See if the below solution is meeting your needs.

#make some fake data
href <- data.frame(start = runif(10), stop = runif(10), other_col = sample(letters, 10))
CTSS <- data.frame(col1 = runif(100), col2 = runif(100))

# for each row in href (but extract only stop and start columns)
result <- apply(X = href[, c("start", "stop")], MARGIN = 1, FUN = function(x, ctss) {
            criterion <- x["start"] #make a criterion
            #see which values are smaller or equal to this criterion (and sort them)
            extracted <- sort(ctss[ctss$col2 <= criterion, "col2"])
            #extract last and one to last value
            get.values <- extracted[c(length(extracted) - 1, length(extracted))] 
            #put values in data frame
            out <- as.data.frame(matrix(get.values, ncol = 2)) 
            return(out)
        }, ctss = CTSS)

#pancake a list into a data.frame
result <- do.call("rbind", result)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am given two very large data sets and I’ve been trying to build

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply