I have language data containing over 200 languages with some missing values coded as ” (0 length characters).
I would like to compress this using factor to code main languages, and all others as ‘other language’ while the ” is coded as ‘(missing)’ showing up at the end of the string.
My plan is this:
lanfmt <- list(
lev = c(prime <- c('English', 'Russian', 'Urdu'), diff <- setdiff(levels(lan), c(prime, '')), ''),
lab = c(prime, diff, '')
)
table(factor(lan, lanfmt$levels, lanfmt$labels)
but R doesn’t like many-to-one formats of factors. How do I aggregate into a single category?
EDIT:
I decided a good solution, using lanfmt as described above is the following:
table(lanfmt$lab[match(lang, lanfmt$lev)])
It’s not as elegant, but it works in a pinch.
I think you should convert your factors to character, edit them and then order them. Maybe something like this would help (
lanbeing the language vector of your list / data frame):After that you can order your languages