I’m dealing with a categorical variable retrieved from a database and am wanting to

Question

0

Asked: May 31, 20262026-05-31T13:47:45+00:00 2026-05-31T13:47:45+00:00

I’m dealing with a categorical variable retrieved from a database and am wanting to

0

I’m dealing with a categorical variable retrieved from a database and am wanting to use factors to maintain the “fullness” of the data.

For instance, I have a table which stores colors and their associated numerical ID

  ID  | Color
------+-------
    1 | Black
 1805 | Red
 3704 | White

So I’d like to use a factor to store this information in a data frame such as:

Car Model | Color
----------+-------
Civic     | Black
Accord    | White
Sentra    | Red

where the color column is a factor and the underlying data stored, rather than being a string, is actually c(1, 3704, 1805) — to IDs associated with each color.

So I can create a custom factor by modifying the levels attribute of an object of the factor class to achieve this effect.

Unfortunately, as you can see in the example, my IDs are not incremented. In my application, I have ~30 levels and the maximum ID for one level is ~9,000. Because the levels are stored in an array for a factor, that means I’m storing an integer vector of length 9,000 with only 30 elements in it.

Is there any way to use a hash or list to accomplish this effect more efficiently? i.e. if I were to use a hash in the levels attribute of a factor, I could store all 30 elements with whatever indices I please without having to create an array of size max(ID).

Thanks in advance!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T13:47:46+00:00

Well, I’m pretty sure you can’t change how factors work. A factor always has level ids that are integer numbers 1..n where n is the number of levels.

…but you can easily have a translation vector to get to your color ids:

# The translation vector...
colorIds <- c(Black=1,Red=1805,White=3704)

# Create a factor with the correct levels 
# (but with level ids that are 1,2,3...)
f <- factor(c('Red','Black','Red','White'), levels=names(colorIds))
as.integer(f) # 2 1 2 3

# Translate level ids to your color ids
colorIds[f] # 1805 1 1805 3704

Technically, colorIds does not need to define the names of the colors, but it makes it easier to have in one place since the names are used when creating the levels for the factor. You want to specify the levels explicitly so that the numbering of them matches even if the levels are not in alphabetical order (as yours happen to be).

EDIT It is however possible to create a class deriving from factor that has the codes as an attribute. Lets call this new glorious class foo:

foo <- function(x = character(), levels, codes) {
    f <- factor(x, levels)
    attr(f, 'codes') <- codes
    class(f) <- c('foo', class(f))
    f
}

`[.foo` <- function(x, ...) {
    y <- NextMethod('[')
    attr(y, 'codes') <- attr(x, 'codes')
    y
}

as.integer.foo <- function(x, ...) attr(x,'codes')[unclass(x)]

# Try it out
set.seed(42)
f <- foo(sample(LETTERS[1:5], 10, replace=TRUE), levels=LETTERS[1:5], codes=101:105)

d <- data.frame(i=11:15, f=f)

# Try subsetting it...
d2 <- d[2:5,]

# Gets the codes, not the level ids...
as.integer(d2$f) # 105 102 105 104

You could then also fix print.foo etc…

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m dealing with a categorical variable retrieved from a database and am wanting to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply