Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9154749
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T12:28:38+00:00 2026-06-17T12:28:38+00:00

I have a data table with nrow being around a million or two and

  • 0

I have a data table with nrow being around a million or two and ncol of about 200.

Each entry in a row has a coordinate associated with it.

Tiny portion of the data:

[1,] -2.80331471  -0.8874522 -2.34401863   -3.811584   -2.1292443
[2,]  0.03177716   0.2588624  0.82877467    1.955099    0.6321881
[3,] -1.32954665  -0.5433407 -2.19211837   -2.342554   -2.2142461
[4,] -0.60771429  -0.9758734  0.01558774    1.651459   -0.8137684

Coordinates for the first 4 rows:

9928202 9928251 9928288 9928319

What I would like is a function that given the data and window-size would return a data table of the same size with a mean sliding window applied on each column. Or in other words – for each row entry i it would find entries with coordinates between coords[i]-windsize and coords[i]+windsize and replace the initial value with the mean of the values inside that interval (separately for each column).

Speed is the main issue here.

Here is my first take of such function.

doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])

### get indeces of rows falling in each window
COORDS <- as.list(coords)
WINDOWINDS <- sapply(COORDS, function(crds){ unique(RANGE[(crds-windHalfSize):
    (crds+windHalfSize)]) })

### do windowing

wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
    wind_ints[i,] <- apply(as.matrix(intensities[WINDOWINDS[[i]],]), 2, mean)
}
return(wind_ints)
}

The code before the last for loop is quite fast and it gets me a list of the indexes I need to use for each entry. However then everything falls apart since I need to grind the for loop a million times, take subsets of my data table and also make sure that I have more than one row to be able to work with all the columns at once inside apply.

My second approach is to just stick the actual values in the RANGE list, fill the gaps with zeroes and do rollmean from zoo package, repeated for each column. But this is redundant since rollmean will go through all the gaps and I will only be using the values for original coordinates in the end.

Any help to make it faster without going to C would be very appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T12:28:39+00:00Added an answer on June 17, 2026 at 12:28 pm

    Data generation:

    N <- 1e5 # rows
    M <- 200 # columns
    W <- 10  # window size
    
    set.seed(1)
    intensities <- matrix(rnorm(N*M), nrow=N, ncol=M)
    coords <- 8000000 + sort(sample(1:(5*N), N))
    

    Original function with minor modifications I used for benchmarks:

    doSlidingWindow <- function(intensities, coords, windsize) {
      windHalfSize <- ceiling(windsize/2)
      ### whole range inds
      RANGE <- integer(max(coords)+windsize)
      RANGE[coords] <- c(1:length(coords)[1])
    
      ### get indices of rows falling in each window
      ### NOTE: Each elements of WINDOWINDS holds zero. Not a big problem though.
      WINDOWINDS <- sapply(coords, function(crds) ret <- unique(RANGE[(crds-windHalfSize):(crds+windHalfSize)]))
    
      ### do windowing
      wind_ints <- intensities
      wind_ints[] <- 0
      for(i in 1:length(coords)) {
        # CORRECTION: When it's only one row in window there was a trouble
        wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)
      }
      return(wind_ints)
    }
    

    POSSIBLE SOLUTIONS:


    1) data.table

    data.table is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed, data.table code is elegant, but unfortunately very slow:

    require(data.table)
    require(plyr)
    dt <- data.table(coords, intensities)
    setkey(dt, coords)
    aaply(1:N, 1, function(i) dt[WINDOWINDS[[i]], sapply(.SD,mean), .SDcols=2:(M+1)])
    

    2) foreach+doSNOW

    Basic routine is easy to run in parallel, so, we can benefit from it:

    require(doSNOW)
    doSlidingWindow2 <- function(intensities, coords, windsize) {
      NC <- 2 # number of nodes in cluster
      cl <- makeCluster(rep("localhost", NC), type="SOCK")
      registerDoSNOW(cl)
    
      N <- ncol(intensities) # total number of columns
      chunk <- ceiling(N/NC) # number of columns send to the single node
    
      result <- foreach(i=1:NC, .combine=cbind, .export=c("doSlidingWindow")) %dopar% {
        start <- (i-1)*chunk+1
        end   <- ifelse(i!=NC, i*chunk, N)
        doSlidingWindow(intensities[,start:end], coords, windsize)    
      }
    
      stopCluster(cl)
      return (result)
    }
    

    Benchmark shows notable speed-up on my Dual-Core processor:

    system.time(res <- doSlidingWindow(intensities, coords, W))
    #    user  system elapsed 
    # 306.259   0.204 307.770
    system.time(res2 <- doSlidingWindow2(intensities, coords, W))
    #  user  system elapsed 
    # 1.377   1.364 177.223
    all.equal(res, res2, check.attributes=FALSE)
    # [1] TRUE
    

    3) Rcpp

    Yes, I know you asked “without going to C“. But, please, take a look. This code is inline and rather straightforward:

    require(Rcpp)
    require(inline)
    doSlidingWindow3 <- cxxfunction(signature(intens="matrix", crds="numeric", wsize="numeric"), plugin="Rcpp", body='
      #include <vector>
      Rcpp::NumericMatrix intensities(intens);
      const int N = intensities.nrow();
      const int M = intensities.ncol();
      Rcpp::NumericMatrix wind_ints(N, M);
    
      std::vector<int> coords = as< std::vector<int> >(crds);
      int windsize = ceil(as<double>(wsize)/2);  
    
      for(int i=0; i<N; i++){
        // Simple search for window range (begin:end in coords)
        // Assumed that coords are non-decreasing
        int begin = (i-windsize)<0?0:(i-windsize);
        while(coords[begin]<(coords[i]-windsize)) ++begin;
        int end = (i+windsize)>(N-1)?(N-1):(i+windsize);
        while(coords[end]>(coords[i]+windsize)) --end;
    
        for(int j=0; j<M; j++){
          double result = 0.0;
          for(int k=begin; k<=end; k++){
            result += intensities(k,j);
          }
          wind_ints(i,j) = result/(end-begin+1);
        }
      }
    
      return wind_ints;
    ')
    

    Benchmark:

    system.time(res <- doSlidingWindow(intensities, coords, W))
    #    user  system elapsed 
    # 306.259   0.204 307.770
    system.time(res3 <- doSlidingWindow3(intensities, coords, W))
    #  user  system elapsed 
    # 0.328   0.020   0.351
    all.equal(res, res3, check.attributes=FALSE)
    # [1] TRUE
    

    I hope results are quite motivating. While data fits in memory Rcpp version is pretty fast. Say, with N <- 1e6 and M <-100 I got:

       user  system elapsed 
      2.873   0.076   2.951
    

    Naturally, after R starts using swap everything slows down. With really large data that doesn’t fit in memory you should consider sqldf, ff or bigmemory.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a data table filled with text in each table row. How do
I have two data.table s in R: > tables() NAME NROW MB COLS KEY
I have a data table which already has some values, plus it is getting
I have data table containing one column as FilePath. FilePath D:\New folder\link.txt D:\New folder\SharepointMigration(Work
I have a data.table object like this one library(data.table) a <- structure(list(PERMNO = c(10006L,
I have a data table with many rows and columns. How can I display
I have the data table from the jquery plugin dataTables (http://datatables.net/) that I want
I have a data.table object similar to this one library(data.table) c <- data.table(CO =
I have a data Table with numbers formatted according to the current regional settings.
Ok I have a data table containing duplicate Reciept numbers and a transaction value

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.