There are two numeric columns in a data file. I need to calculate the average of the second column by intervals (such as 100) of the first column.
I can program this task in R, but my R code is really slow for a relatively large data file (millions of rows, with the value of first column changing between 1 to 33132539).
Here I show my R code. How could I tune it to be faster? Other solutions that are perl, python, awk or shell based are appreciated.
Thanks in advance.
(1) my data file (tab-delimited, millions of rows)
5380 30.07383\n
5390 30.87\n
5393 0.07383\n
5404 6\n
5428 30.07383\n
5437 1\n
5440 9\n
5443 30.07383\n
5459 6\n
5463 30.07383\n
5480 7\n
5521 30.07383\n
5538 0\n
5584 20\n
5673 30.07383\n
5720 30.07383\n
5841 3\n
5880 30.07383\n
5913 4\n
5958 30.07383\n
(2) what I want to get, here interval = 100
intervals_of_first_columns, average_of_2nd column_by_the_interval
100, 0\n
200, 0\n
300, 20.34074\n
400, 14.90325\n
.....
(3) R code
chr1 <- 33132539 # set the limit for the interval
window <- 100 # set the size of interval
spe <- read.table("my_data_file", header=F) # read my data in
names(spe) <- c("pos", "rho") # name my data
interval.chr1 <- data.frame(pos=seq(0, chr1, window)) # setup intervals
meanrho.chr1 <- NULL # object for the mean I want to get
# real calculation, really slow on my own data.
for(i in 1:nrow(interval.chr1)){
count.sub<-subset(spe, chrom==1 & pos>=interval.chr1$pos[i] & pos<=interval.chr1$pos[i+1])
meanrho.chr1[i]<-mean(count.sub$rho)
}
You don’t really need to set up an output data.frame but you can if you want. Here is how I would have coded it, and I guarantee it will be fast.
You could have done even less setup (skip the incrmt variable with this code:
And if you want the result to be available for something: