I’m trying to use doSMP / foreach to parallelize some code in R.
I had a huge 2d matrix of genetic data – 10,000 observations (rows), and 3 million variables (columns). I had to split this data up into chunks of 1000 variables because of memory issues.
I want to read in each file, do some stats, and write out those results to a file. This is easy with a for loop, but I want to use foreach to speed it up. Here’s what I’m doing:
# load doSMP, foreach, iterators, codetools
require(doSMP)
# files i'm processing
print(filelist <- system("ls matrix1k.*.txt", T))
#initialize processes
w <- startWorkers(2)
registerDoSMP(w)
# for each file, read into memory, do some stuff, write out results.
foreach (i = 1:length(filelist)) %dopar% {
print(i)
file <- filelist[i]
print(file)
thisfile <- read.table(file,header=T)
# here i'll do stuff using that file
# here i'll write out results of the stuff I do above
}
#stop processes
stopWorkers(w)
But this results in an error: Error in { : task 2 failed - "cannot open the connection". When I change the %dopar% to %do%, there’s no issue at all.
I don’t think that parallel input does speed up things. The limiting factor is the disk controller, so it does not help when you open up 2 connections and read the data because it has to go through the disk controller anyway. Disk IO is a serial job (sadly) unless you have a RAID array with several disk controllers. Parallel IO only works well on clusters where each machine has its own disk.