Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6235199
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T10:37:45+00:00 2026-05-24T10:37:45+00:00

Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate

  • 0

Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate fn calls below; to not blow out the 1Gb memory limit. *.

I have a large dataset tr stored in a data-frame (3 cols, 12 million rows; ~200Mb)
The columns are customer_id (integer), visit_date and visit_spend(numeric) The dataset requires registration, so this is as reproducible as it can be:
Dataset looks like (full file is here, requires registration):

customer_id,visit_date,visit_spend 
2,2010-04-01,5.97 
2,2010-04-06,12.71 
2,2010-04-07,34.52
#... 12146637 rows in total

The date range is restricted between 2010-04-01 … 2011-06-30 (14700..15155 in integer)

Here I’m asking what is the optimal representation to choose for the visit_date field. I do some aggregate calls (example code at bottom) that blow up the memory. I also use date utility fns something like what is attached at bottom here (they will need recoding for compactness, but these are the typical operations I want to do a lot of). So I need a representation for the date that avoids this.

As I see it there are three possible representations I could use for visit_date field, here are the pros and cons, wrt what I am trying to do.
My aim is to get the format which does not blow up memory, and gives the least grief during these date handling operations, aggregate etc.:

  • integer or factor
    Cons:
    1) doesn’t allow comparison or sort operations, hence aggregation is painful.
    2) I would need to hardcode all the date-related functions (e.g. 14700..14729 for Apr 2010) – doable but painful.
    3) Needs manual handling for graphs.
  • numeric
    Cons: blows up memory due to requiring asDate() everywhere.
  • Date
    Pros: most readable for print(), graphs and histograms; does not need manual handling treatment before graphing.
    Cons: blows up (out-of-memory fail) if I apply any string operations (format), or aggregate.
    I think that’s chron::Date, it’s whatever you get when you set class(tr$visit_date)<-'Date') or use read.csv(colClasses=c(...,"Date",...)

These are the date utility fns I want to run a lot of (but currently they blow up during aggregate):

# Utility fns related to date    
library(chron)    
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
    with( month.day.year(ddd), day.of.week(month,day,year) )
    #do.call( "day.of.week", month.day.year(x) )
    # as.numeric(x-3)%%7
}    
weekofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%m"))
}    
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
    x$dotwa <- format(x$from_date_fld,"%a")
}

and here’s just one aggregate() call that fails out-of-memory:

agg_dly_spend  <- aggregate(tr$visit_spend, 
        list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend, 
        list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")

(How much memory should those aggregate() calls take?
Correct me if I’m wrong but the mixed-types make it hard to use bigmemory. So I may have to go to SQL, but that’s a big loss – I lose R’s really nice subsetting-by-date: tr[tr$visit_date > "2010-09-01",])

(Platform is R 2.13.1, Windows Vista 32b so there is a 2Gb overall process limit, which means any data-frame should not exceed ~600-900Mb)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T10:37:46+00:00Added an answer on May 24, 2026 at 10:37 am

    EDIT : The code I copied were not the final functions, so there were bugs in it. Bugs now fixed.

    I do not completely agree with the votes to close, but your question does need some reading. As I understood it, the problem is the representation of the date. Numeric is just a crazy idea, use integer in that case. Just as an overview of the different formats and their relative space (using the lsos function from this question🙂

    Dates <- rep(seq.Date(as.Date("2010-04-01"),
                 as.Date("2011-04-01"),by='day'),each=100)
    
    chardates <- as.character(Dates)
    POSIXctDates <- as.POSIXct(Dates)
    POSIXltDates <- as.POSIXlt(Dates)
    integerDates <- as.integer(Dates)
    chronDates <- as.chron(Dates)
    numericDates <- as.numeric(Dates)
    
     > lsos()
                      Type    Size PrettySize  Rows Columns
    POSIXltDates   POSIXlt 1464976     1.4 Mb 36600      NA
    chronDates       dates  293400   286.5 Kb 36600      NA
    POSIXctDates   POSIXct  292976   286.1 Kb 36600      NA
    Dates             Date  292944   286.1 Kb 36600      NA
    numericDates   numeric  292824     286 Kb 36600      NA
    charDates    character  161064   157.3 Kb 36600      NA
    integerDates   integer  146424     143 Kb 36600      NA
    

    The internal Date can compete pretty well with the numeric representation. character causes trouble with all the rest of the functionality, so forget about that one. I just use Date, that’ll do and keeps the functionality OK. Pay attention to the size of POSIXlt: All functions for extraction of months, weeks, day of the year etc. go over that format. That’s true for format(), and for the functions weekdays(), months(), … in either the base or the chron package.

    Some remarks :

    • Vista has a theoretical 3Gb limit on a machine with 4Gb. did you try memory.limit(3000) : See ?memory.limit.
    • Please update asap to Windows 7 if you care anything at all about decent memory management. Vista just sucks. Big time. If possible, get a 64bit machine if you work with that data.
    • Clean up your workspace once in a while. I can run your code without problem on R 32bit with a 2Gb memory limit on the dataframe given below (14+ million rows).

    On to your code. I work on with the Date format, which is about the same size as the numeric format. Let’s try it with following data (which you could have provided…) with 14.6 million rows. I run a Windows7 (64bit) with 4Gb memory in total.

    n <- 40000
    tr <- data.frame(
      customer_id = rep(1:366,n),
      visit_date = rep(seq.Date(as.Date("2010-04-01"),
                         as.Date("2011-04-01"),by='day'),each=n),
      visit_spend =runif(366*n,3,12)
     )
    

    First your weekofyear function. As said, The format function uses the underlying POSIXlt format, which is, as shown, memory-intensive. Still, you can cut out about half of the memory by just accessing it directly (see ?POSIXlt). It returns integers, which take about half the memory of the numerics you return :

    dayofweek2 <- function(dt) {
      as.POSIXlt(dt)$wday
    }
    
    monthofyear2 <- function(dt){
      as.POSIXlt(dt)$mon +1L
    }
    
    weekofyear2 <- function(dt) {
      n <- length(dt)
      lt <- as.POSIXlt(dt)
      yearstart <- c(0,which(diff(lt$year)!=0))
      fday <- rep(lt[yearstart+1L]$wday,
                  diff(c(yearstart,n)+1L)
              )
      as.integer((lt$yday+fday-1L)%/%7)
    }
    
    > lsos()
                       Type    Size PrettySize  Rows Columns
    tr           data.frame  733128   715.9 Kb 36600       3
    Dates              Date  293048   286.2 Kb 36600      NA
    x1              numeric  292840     286 Kb 36600      NA # from your function
    x2              integer  146440     143 Kb 36600      NA # from my function
    

    If you need even less, you’ll have to do the math yourself. But I advise you not to try that out, and definitely not based on character representation. string operations like strsplit() and substr() will blow up your memory for sure. As does the month.day.year() function of the chron package. Stay away from chron with big data. In fact, regardless of the huge space the POSIXlt objects need, using POSIXlt is still the best option memory-wise for extraction.

    On to the aggregate. This is meant for dataframes, and hence aggregate call makes again a lot of copies of the data. Doing the call more manually, can save again on the copies. A proposal for a function :

    my.agg <- function(x,d,AGGFUN=NULL,FUN="sum"){
      FUN <- match.fun(FUN)
      if(!is.null(AGGFUN)) {
        AGGFUN <- match.fun(AGGFUN)
        d <- AGGFUN(d)
      }
      ud <- sort(unique(d))
    
      xout <- sapply(ud,function(i){
        id <- which(d==i)  # find which values respond to a certain value of d
        FUN(x[id])         # apply the function to only those values
      },USE.NAMES=FALSE)
    
      data.frame(d=ud,x=xout)    
    }
    

    Now if we apply this and we watch the memory usage :

    gc(TRUE,reset=TRUE)
    x1 <-  aggregate(tr$visit_spend, 
               list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
    rm(x1)
    gc(TRUE,reset=TRUE)
    Sys.sleep(5)
    x2 <- my.agg(tr$visit_spend,tr$visit_date,weekofyear2,sum)    
    gc(TRUE,reset=TRUE)
    

    , I get the following result :

    enter image description here

    The red square is your aggregate call, the yellow square is my proposal. The first bump in the memory usage of you aggregate call is the weekofyear function you use. My proposal saves both on the memory usage of weekofyear and of the aggregate call, and runs quite a bit faster too. I never got over 2.6Gb total memory using my proposal.

    Hope this helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Executive Summary: When assertion errors are thrown in the threads, the unit test doesn't
Executive summary: how do I install a new root certificate into Java using Java
Executive summary: I need a way to determine whether a Windows process I've spawned
(See below solution I created using the answer I accepted) I'm trying to improve
Executive summary: How can I define an arbitrarily-sized 2D array in C? How can
Summary: Running mvn war:war fails with errors including: The following artifacts could not be
Executive Summary context.Response.Header.Add("Content-Type", "application/json"); gets ignored while context.Response.ContentType = "application/json"; breaks jQuery ajax calls.
Even after introducing an intentional memory leak valgrind shows: ==13483== HEAP SUMMARY: ==13483== in
Executive Summary: preg_replace() ran faster than string comparisons. Why? Shouldn't regular expressions be slower?
Executing the below code gives me the following exception on the last line: InvalidOperationException:

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.