I started using data.table package in R to boost performance of my code. I am using the following code:
sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]
sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]
I noticed that the conversion done by as.Date function is very slow, when compared to other functions that create weekdays, etc. Why is that? Is there a better/faster solution, how to convert into date-format? (If you would ask whether I really need the date format, probably yes, because then use ggplot2 to make plots, which work like a charm with this type of data.)
To be more precise
> system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")])
user system elapsed
92.603 0.289 93.014
> system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)])
user system elapsed
1.938 0.062 2.001
> system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)])
user system elapsed
0.304 0.001 0.305
On MacAir i5 with slightly less then 3000000 observations.
I think it’s just that
as.DateconvertscharactertoDateviaPOSIXlt, usingstrptime. Andstrptimeis very slow, I believe.To trace it through yourself, type
as.Date, thenmethods(as.Date), then look at thecharactermethod.Why is
as.POSIXlt(Date)$year+1900relatively fast? Again, trace it through :Intrigued, let’s dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.
Now we know we need to look for D2POSIXlt :
Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :
datetime.c
Search in there for
D2POSIXltand you’ll see how simple it is to go from Date (numeric) to POSIXlt. You’ll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That’s 40 bytes, per date!So the crux of the issue (I think) is why
strptimeis so slow, and maybe that can be improved in R. Or just avoidPOSIXlt, either directly or indirectly.Here’s a reproducible example using the number of items stated in question (3,000,000) :
Passing
tzappears to speed upstrptime, whichas.Date.characterdoes. So maybe it depends on your locale. Butstrptimeappears to be the culprit, notdata.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?