Many data mining algorithms/strategies use vector representation of data records in order to simulate a spatial representation of the data (like support vector machines).
My trouble comes from how to represent non-numerical features within the dataset. My first thought was to ‘alias’ each possible value for a feature with a number from 1 to n (where n is the number of features).
While doing some research I came across a suggestion that when dealing with features that have a small number of possible values that you should use a bit string of length n where each bit represents a different value and only the one bit corresponding to the value being stored is flipped. I can see how you could theoretically save memory using this method with features that have less possible values than the number of bits used to store an integer value on your target system but the data set I’m working with has many different values for various features so I don’t think that solution will help me at all.
What are some of the accepted methods of representing these values in vectors and when is each strategy the best choice?
So there’s a convention to do this. It’s much easier to show by example than to explain.
Suppose you have have collected from your web analytics app, four sets of metrics describing each visitor to a web site:
sex/gender
acquisition channel
forum participation level
account type
Each of these is a categorical variable (aka factor) rather than a continuous variable (e.g., total session time, or account age).
This transformation technique seems more sensible to me than keeping each variable as a single column. For instance, if you do the latter, then you have to reconcile the three types of acquisition channels with their numerical representation–i.e., if organic search is a “1” should affiliate be a 2 and direct_typein a 3, or vice versa?
Another significant advantage of this representation is that it is, despite the width expansion, a compact representation of the data. (In instances where the column expansion is substantial–i.e., one field is user state, which might mean 1 column becomes 50, a sparse matrix representation is obviously a good idea.)
for this type of work i use the numerical computation libraries NumPy and SciPy.
from the Python interactive prompt:
For similarity-based computation (e.g. k-Nearest Neighbors), there is a particular metric used for expanded data vectors comprised of categorical variables called the Tanimoto Coefficient. For the particular representation i have used here, the function would look like this: