Dear StackOverflow community,
I have a dataset from my university projects I am trying to parse and run some calculations on. It looks similar to:
Month,1,2,3,3,4,4,5,6,7
x.1,0,0,0,0,0,0,0,0,0
x.2,0,0,0,0,0,0,0,0,0
x.3,0,0,0,6,5,5,,,15
x.4,0,0,0,7,7,,,,15
x.5,1,1,1,11,7,5,,,0
x.6,1,1,1,14,6,,,,0
x.7,1,1,1,17,5,,,,15
x.8,1,1,1,21,4,,,,15
x.9,0,0,0,1,1,1,1,1,0
x.10,0,0,0,1,1,1,1,1,0
x.11,1,0,0,1,1,1,1,1,0
x.12,0,0,0,0,0,0,0,0,1
x.13,0,0,0,0,0,0,0,0,0
x.14,0,1,0,0,0,0,0,0,0
x.20,orchid,,,orchid,rose,orchid,orchid,orchid,
x.23,0,0,0,1,1,1,1,1,1
x.24,,,,,buttercup,buttercup,buttercup,buttercup,lilac
x.25,0,0,0,1,1,0,1,1,1
x.26,,,,17,,,,,15
x.27,,,,999,,,,,15
I try to then import it like so:
data <- read.csv("~/data_munging/data.csv", header=F)
my_matrix <- as.matrix(data)
The issue here is that the dataset’s first column is actually the names of the variables, and as.matrix() does not read it as row (variable) names.
(There are also holes in some of the data, but that I will leave for another question).
I am new to R and am wondering What am I Doing Wrong™?
Update:
As per Justin’s comments, here is how import the dataset and the str() it produces:
> sample_data <- read.csv("~/data_munging/sample_data.csv", header=F)
> str(sample_data)
'data.frame': 28 obs. of 10 variables:
$ V1 : Factor w/ 28 levels "Month","x.1","x.10",..: 1 2 13 22 23 24 25 26 27 28 ...
$ V2 : Factor w/ 4 levels "","0","1","orchid": 3 2 2 2 2 3 3 3 3 2 ...
$ V3 : int 2 0 0 0 0 1 1 1 1 0 ...
$ V4 : int 3 0 0 0 0 1 1 1 1 0 ...
$ V5 : Factor w/ 12 levels "","0","1","11",..: 8 2 2 9 10 4 5 6 7 3 ...
$ V6 : Factor w/ 9 levels "","0","1","4",..: 4 2 2 5 7 7 6 5 4 3 ...
$ V7 : Factor w/ 7 levels "","0","1","4",..: 4 2 2 5 1 5 1 1 1 3 ...
$ V8 : Factor w/ 6 levels "","0","1","5",..: 4 2 2 1 1 1 1 1 1 3 ...
$ V9 : Factor w/ 6 levels "","0","1","6",..: 4 2 2 1 1 1 1 1 1 3 ...
$ V10: Factor w/ 6 levels "","0","1","15",..: 5 2 2 4 4 2 2 4 4 2 ...
The reason I believe it should be a matrix is because this way it reads the Month as a factor and its levels are the row names instead of moths (month of year).
Update 2: Now with the original dataset in CSV.
There is a transpose method for matrices and dataframes which returns a matrix.:
Resulting in:
I do notice that there are both a 999 value that is probably a missing value indicator, as well as two different values for missing in the factor columns. That is a side-effect of how read.table input the columns. It “thought” that the V3 and V4 columns were numeric and handled sequential commas as a true missing, whereas all the other columns (before transposition) were seen as factor or character variables and sequential commas got turned into “” which is not the same as _NA_character or the NA for factors.