I am working with a very large data set which I am downloading from an Oracle data base. The Data frame has about 21 millions rows and 15 columns.
My OS is windows xp (32-bit), I have 2GB RAM. Short-term I cannot upgrade my RAM or my OS (it is at work, it will take months before I get a decent pc).
library(RODBC)
sqlQuery(Channel1,"Select * from table1",stringsAsFactor=FALSE)
I get here already stuck with the usual “Cannot allocate xMb to vector”.
I found some suggestion about using the ff package. I would appreciate to know if anybody familiar with the ff package can tell me if it would help in my case.
Do you know another way to get around the memory problem?
Would a 64-bit solution help?
Thanks for your suggestions.
If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.
In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) – in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM).
Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.
The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time – especially if you are new to R.