Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate fn calls below; to not blow out the 1Gb memory limit. *.
I have a large dataset tr stored in a data-frame (3 cols, 12 million rows; ~200Mb)
The columns are customer_id (integer), visit_date and visit_spend(numeric) The dataset requires registration, so this is as reproducible as it can be:
Dataset looks like (full file is here, requires registration):
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
#... 12146637 rows in total
The date range is restricted between 2010-04-01 … 2011-06-30 (14700..15155 in integer)
Here I’m asking what is the optimal representation to choose for the visit_date field. I do some aggregate calls (example code at bottom) that blow up the memory. I also use date utility fns something like what is attached at bottom here (they will need recoding for compactness, but these are the typical operations I want to do a lot of). So I need a representation for the date that avoids this.
As I see it there are three possible representations I could use for visit_date field, here are the pros and cons, wrt what I am trying to do.
My aim is to get the format which does not blow up memory, and gives the least grief during these date handling operations, aggregate etc.:
- integer or factor
Cons:
1) doesn’t allow comparison or sort operations, hence aggregation is painful.
2) I would need to hardcode all the date-related functions (e.g. 14700..14729 for Apr 2010) – doable but painful.
3) Needs manual handling for graphs. - numeric
Cons: blows up memory due to requiring asDate() everywhere. - Date
Pros: most readable for print(), graphs and histograms; does not need manual handling treatment before graphing.
Cons: blows up (out-of-memory fail) if I apply any string operations (format), or aggregate.
I think that’s chron::Date, it’s whatever you get when you setclass(tr$visit_date)<-'Date')or useread.csv(colClasses=c(...,"Date",...)
These are the date utility fns I want to run a lot of (but currently they blow up during aggregate):
# Utility fns related to date
library(chron)
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
with( month.day.year(ddd), day.of.week(month,day,year) )
#do.call( "day.of.week", month.day.year(x) )
# as.numeric(x-3)%%7
}
weekofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%m"))
}
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
x$dotwa <- format(x$from_date_fld,"%a")
}
and here’s just one aggregate() call that fails out-of-memory:
agg_dly_spend <- aggregate(tr$visit_spend,
list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend,
list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
(How much memory should those aggregate() calls take?
Correct me if I’m wrong but the mixed-types make it hard to use bigmemory. So I may have to go to SQL, but that’s a big loss – I lose R’s really nice subsetting-by-date: tr[tr$visit_date > "2010-09-01",])
(Platform is R 2.13.1, Windows Vista 32b so there is a 2Gb overall process limit, which means any data-frame should not exceed ~600-900Mb)
EDIT : The code I copied were not the final functions, so there were bugs in it. Bugs now fixed.
I do not completely agree with the votes to close, but your question does need some reading. As I understood it, the problem is the representation of the date. Numeric is just a crazy idea, use integer in that case. Just as an overview of the different formats and their relative space (using the lsos function from this question🙂
The internal
Datecan compete pretty well with the numeric representation.charactercauses trouble with all the rest of the functionality, so forget about that one. I just useDate, that’ll do and keeps the functionality OK. Pay attention to the size ofPOSIXlt: All functions for extraction of months, weeks, day of the year etc. go over that format. That’s true forformat(), and for the functionsweekdays(),months(), … in either thebaseor thechronpackage.Some remarks :
memory.limit(3000): See?memory.limit.On to your code. I work on with the
Dateformat, which is about the same size as the numeric format. Let’s try it with following data (which you could have provided…) with 14.6 million rows. I run a Windows7 (64bit) with 4Gb memory in total.First your weekofyear function. As said, The
formatfunction uses the underlying POSIXlt format, which is, as shown, memory-intensive. Still, you can cut out about half of the memory by just accessing it directly (see?POSIXlt). It returns integers, which take about half the memory of the numerics you return :If you need even less, you’ll have to do the math yourself. But I advise you not to try that out, and definitely not based on character representation. string operations like
strsplit()andsubstr()will blow up your memory for sure. As does themonth.day.year()function of thechronpackage. Stay away fromchronwith big data. In fact, regardless of the huge space the POSIXlt objects need, using POSIXlt is still the best option memory-wise for extraction.On to the
aggregate. This is meant for dataframes, and henceaggregatecall makes again a lot of copies of the data. Doing the call more manually, can save again on the copies. A proposal for a function :Now if we apply this and we watch the memory usage :
, I get the following result :
The red square is your aggregate call, the yellow square is my proposal. The first bump in the memory usage of you aggregate call is the
weekofyearfunction you use. My proposal saves both on the memory usage ofweekofyearand of theaggregatecall, and runs quite a bit faster too. I never got over 2.6Gb total memory using my proposal.Hope this helps.