I am reading a large events file in R, close to 2 million lines, parsing each line into a set of event attributes and storing in my matrix. I pre-allocate a huge matrix (2 million events), read a small chunk from the file, repeatedly, and process it. But it is taking too long to process the file. I was wondering what I can do to improve the performance. Here is my code snippet:
numEvents <<- 2000000;
eventLog <<- matrix(0,nrow=numEvents,ncol=4);
loadEvents <- function(inputfile) {
con <- file(inputfile, "r", blocking = FALSE)
batch <- 1000
lines <- readLines(con,n=batch)
while(length(lines) > 0 && eventCount <= numEvents) {
for (i in 1:length(lines))
storeEvent(lines[i]); # processes and stores each event in eventlog
lines <- readLines(con,n=batch)
}
close(con);
}
Do you think batch size is not optimal?
Any ideas here very much appreciated.
Found the issue in my processing. I was using list as a map to contain mappings for events. List is not inherently a hash map so it can be quite slow. I changed it to use hash() and performance improved ten fold. Thanks.