I have some data:
transaction <- c(1,2,3);
date <- c("2010-01-31","2010-02-28","2010-03-31");
type <- c("debit", "debit", "credit");
amount <- c(-500, -1000.97, 12500.81);
oldbalance <- c(5000, 4500, 17000.81)
evolution <- data.frame(transaction, date, type, amount, oldbalance, row.names=transaction, stringsAsFactors=FALSE);
evolution$date <- as.Date(evolution$date, "%Y-%m-%d");
evolution <- transform(evolution, newbalance = oldbalance + amount);
evolution
If I enter the command:
type <- factor(type)
where type is nominal (categorical) variable,then what difference does it make to my data?
Thanks
Factors vs character vectors when doing stats:
In terms of doing statistics, there’s no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.
If you do a regression or ANOVA with lm() with a character vector as a categorical
variable you’ll get normal model output but with the message:
Factors vs character vectors when manipulating dataframes:
When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.
Its useful to use
stringsAsFactors = FALSEwhen reading data in from a .csv or .txt usingread.tableorread.csv. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.Here is a worked example showing how lm() gives you the same results with
a character vector and a factor.
A random independent variable:
A random categorical variable as a character vector:
Convert the character vector to a factor variable.
factor_x <- as.factor(character_x)
Give the two categories random values:
Create a random relationship between the indepdent variables and a dependent variable
Compare the output of a linear model with the factor variable and the character
vector. Note the warning that is given with the character vector.