I’m dealing with a categorical variable retrieved from a database and am wanting to use factors to maintain the “fullness” of the data.
For instance, I have a table which stores colors and their associated numerical ID
ID | Color
------+-------
1 | Black
1805 | Red
3704 | White
So I’d like to use a factor to store this information in a data frame such as:
Car Model | Color ----------+------- Civic | Black Accord | White Sentra | Red
where the color column is a factor and the underlying data stored, rather than being a string, is actually c(1, 3704, 1805) — to IDs associated with each color.
So I can create a custom factor by modifying the levels attribute of an object of the factor class to achieve this effect.
Unfortunately, as you can see in the example, my IDs are not incremented. In my application, I have ~30 levels and the maximum ID for one level is ~9,000. Because the levels are stored in an array for a factor, that means I’m storing an integer vector of length 9,000 with only 30 elements in it.
Is there any way to use a hash or list to accomplish this effect more efficiently? i.e. if I were to use a hash in the levels attribute of a factor, I could store all 30 elements with whatever indices I please without having to create an array of size max(ID).
Thanks in advance!
Well, I’m pretty sure you can’t change how factors work. A factor always has level ids that are integer numbers
1..nwherenis the number of levels.…but you can easily have a translation vector to get to your color ids:
Technically,
colorIdsdoes not need to define the names of the colors, but it makes it easier to have in one place since the names are used when creating the levels for the factor. You want to specify the levels explicitly so that the numbering of them matches even if the levels are not in alphabetical order (as yours happen to be).EDIT It is however possible to create a class deriving from factor that has the codes as an attribute. Lets call this new glorious class
foo:You could then also fix
print.fooetc…