I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn’t work for me.
I’m importing a CSV:
orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)
Here’s the structure of the dataframe:
str(orders)
'data.frame': 3331575 obs. of 2 variables:
$ OrderID : num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
If I run the length command on the first column, OrderID, I get this:
length(orders$OrderID)
[1] 0
If I run the length on OrderDate, it returns correctly:
length(orders$OrderDate)
[1] 3331575
This is a copy/paste of the head of the CSV.
OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07
Now, if I re-run the read.csv, but take out the check.names option, the first column of the dataframe now has an X. at the start of the name.
orders2 <- read.csv("<file_location>", sep=",", header=T)
str(orders2)
'data.frame': 3331575 obs. of 2 variables:
$ X.OrderID: num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
length(orders$X.OrderID)
[1] 3331575
This works correctly.
My question is why does R add an X. to beginning of the first column name? As you can see from the CSV file, there are no special characters. It should be a simple load. Adding check.names, while will import the name from the CSV, will cause the data to not load correctly for me to perform analysis on.
What can I do to fix this?
Side note: I realize this is a minor – I’m just more frustrated by the fact that I think I am loading correctly, yet not getting the result I expected. I could rename the column using colnames(orders)[1] <- "OrderID", but still want to know why it doesn’t load correctly.
read.csv()is a wrapper around the more generalread.table()function. That latter function has argumentcheck.nameswhich is documented as:If your header contains labels that are not syntactically valid then
make.names()will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prependingX:This is documented in
?make.names:The behaviour you are seeing is entirely consistent with the documented way
read.table()loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from?make.namesthat what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences;
\t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don’t take the fact that it doesn’t show anything wrong withoutcheck.namesas indicating that the file is OK.Posting the output of
sessionInfo()would also be useful.