I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let’s say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other
read.tablearguments if needed):or in pure R:
or to not even read the unwanted fields assuming there are 4 fields:
The last line could be modified for the case where we know we want the first two fields but don’t know how many other fields there are:
The sqldf package can be used. In this example we assume a csv file,
data.csvand that the desired fields are calledaandb. If its not a csv file then use appropriate arguments toread.csv.sqlto specify other separator, etc. :