Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8261059
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T03:16:20+00:00 2026-06-08T03:16:20+00:00

I have a data frame full from which I want to take the last

  • 0

I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):

n <- 200000
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 10000), 'A'] <- NA
v <- 1

I have v as one here, but in reality it could change, and full has many columns.


I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:

# DATA FRAME

ord_df <- function() {
  a <- full[c(v, length(full))]
  a[with(a, order(a[1])), ]
}

sl_df <- function() {
  a <- full[c(v, length(full))]
  a[sort.list(a[[1]]), ] 
}


# DATA TABLE

require(data.table)

ord_dt <- function() {
  a <- as.data.table(full[c(v, length(full))])
  colnames(a)[1] <- 'values'
  a[order(values)]
}

sl_dt <- function() {
 a <- as.data.table(full[c(v, length(full))])
 colnames(a)[1] <- 'values'
 a[sort.list(values)]
}


# MATRIX

ord_mat <- function() {
  a <- as.matrix(full[c(v, length(full))])
  a[order(a[, 1]), ] 
}

sl_mat <- function() {
  a <- as.matrix(full[c(v, length(full))])
  a[sort.list(a[, 1]), ] 
}

Time results:

         ord_df  sl_df    ord_dt   sl_dt   ord_mat sl_mat
Min.     0.230   0.1500   0.1300   0.120   0.140   0.1400
Median   0.250   0.1600   0.1400   0.140   0.140   0.1400
Mean     0.244   0.1610   0.1430   0.136   0.142   0.1450
Max.     0.250   0.1700   0.1600   0.140   0.160   0.1600

Or using microbenchmark (results are in milliseconds):

             min      lq       median   uq       max
1  ord_df() 243.0647 248.2768 254.0544 265.2589 352.3984
2  ord_dt() 133.8159 140.0111 143.8202 148.4957 181.2647
3 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.6897
4   sl_df() 152.6985 161.5591 166.5147 171.2891 194.7155
5   sl_dt() 132.1414 139.7655 144.1281 149.6844 188.8592
6  sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416

Seems like ordering the data table wins. There isn’t all that much difference between order and sort.list except when using data frames where sort.list is much faster.

In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn’t get it work since the contents of v are not integer.

I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.


Here’s the code I used for timing if it’s useful to anyone:

sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)

require(plyr)
timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))
colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')
apply(timings, 2, summary) 

require(microbenchmark)
mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())
plot(mb)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T03:16:22+00:00Added an answer on June 8, 2026 at 3:16 am

    I don’t know if it’s better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:

    n <- 1e7
    full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
    full[sample(n, 100000), 'A'] <- NA
    
    fdf <- full
    fma <- as.matrix(full)
    fdt <- as.data.table(full)
    setnames(fdt, colnames(fdt)[1], 'values')
    
    # DATA FRAME
    ord_df <- function() { fdf[order(fdf[1]), ] }
    sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }
    
    # DATA TABLE
    require(data.table)
    ord_dt <- function() { fdt[order(values)] }
    
    key_dt <- function() {
      setkey(fdt, values) 
      fdt
    }
    
    # MATRIX
    ord_mat <- function() { fma[order(fma[, 1]), ] }
    sl_mat <- function() { fma[sort.list(fma[, 1]), ] }
    

    Results (using a different computer, R 2.13.1 and data.table 1.8.2):

             ord_df  sl_df   ord_dt  key_dt  ord_mat sl_mat
    Min.     37.56   20.86   2.946   2.249   20.22   20.21
    1st Qu.  37.73   21.15   2.962   2.255   20.54   20.59
    Median   38.43   21.74   3.002   2.280   21.05   20.82
    Mean     38.76   21.75   3.074   2.395   21.09   20.95
    3rd Qu.  39.85   22.18   3.151   2.445   21.48   21.42
    Max.     40.36   23.08   3.330   2.797   22.41   21.84
    

    Sorting

    So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I’d argue. Thanks for the help everyone.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a data frame with two qualitative variables (Q1, Q2) which are both
I have data frame with two column say a and b now I want
I have a data frame that is some 35,000 rows, by 7 columns. it
I have a data frame consisting of results from multiple runs of an experiment,
I have a data frame of daily data: df with four columns: Date, A,
I have a data.frame (X,Y,a,b,c,d,e) Is there a package where I can predict both
I have data.frame that contains several factors and i want to rename factor levels
I have a data.frame with 6 columns. The first is for subjects, the second
I have a data frame that looks like this, with two key columns and
I have data frame with some numerical variables and some categorical factor variables. The

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.