Possible Duplicate:
Some issues trying to read a file with cbc.read.table function in R + using filter while reading files
a)I’m trying to read a relatively big .txt file with the function cbc.read.table from the colbycol package in R. According to what I’ve been reading this package makes job easier when we have large files (more than a GB to be read in R) and we don’t need all of the columns/variables for our analysis. Also, I read that the function cbc.read.table could support the same read.table‘s parameters. However, if I pass the parameter nrows (in order to get a preview of my file in R) I get the following error:
#My line code. I'm just reading columns 5,6,7,8 out of 27
i.can <- cbc.read.table( "xxx.txt", header = T, sep = "\t",just.read=5:8, nrows=20)
#error message
Error in read.table(file, nrows = 50, sep = sep, header = header, ...) :
formal argument "nrows" matched by multiple actual arguments
So, my question is: could you tell me how can I solve this problem?
b) After that, I tried to read all instances with the following code:
i.can.b <- cbc.read.table( "xxx.txt", header = T, sep = "\t",just.read=4:8) #done perfectly
my.df <- as.data.frame(i.can.b) #getting error in this line
Error in readSingleKey(con, map, key) : unable to obtain value for key 'Company' #Company is a string column in my data set
So, my question is again: How can I solve this?
c) Do you know a way in which I can filter (by conditions on instances) while reading files?
In reply to a):
cbc.read.table()reads in the data in 50 row chunks:Since the function already assigns the
nrowsargument the value50, when it passes thenrowsargument that you specify, there are twonrowsarguments passed toread.table(), resulting in the error. To me, this seems to be a bug. To get around this, you can either modify thecbc.read.table()function to handle the specifiednrowsargument or accept something like amax.rowsargument (and perhaps pass it along to the maintainer as a potential patch). Alternatively, you can specify thesample.pctargument, which specifies the proportion of rows to read. So, if the file contains 100 rows, and you only want 50:sample.pct = 0.5.In reply to b):
Not sure what that error means. It is hard to diagnose without a reproducible example. Do you get the same error if you read in a smaller file?
In reply to c):
I generally prefer storing very large character data in a relational database, such as MySQL. It might be easier in your case to use the RSQLite package, which embeds an SQLite engine within R. Then SQL SELECT queries can be used to retrieve conditional subsets of data. Other packages for larger-than-memory data can be found under Large memory and out-of-memory data here: http://cran.r-project.org/web/views/HighPerformanceComputing.html