Many data mining algorithms/strategies use vector representation of data records in order to simulate

Question

0

Asked: May 23, 20262026-05-23T20:07:15+00:00 2026-05-23T20:07:15+00:00

Many data mining algorithms/strategies use vector representation of data records in order to simulate

0

Many data mining algorithms/strategies use vector representation of data records in order to simulate a spatial representation of the data (like support vector machines).

My trouble comes from how to represent non-numerical features within the dataset. My first thought was to ‘alias’ each possible value for a feature with a number from 1 to n (where n is the number of features).

While doing some research I came across a suggestion that when dealing with features that have a small number of possible values that you should use a bit string of length n where each bit represents a different value and only the one bit corresponding to the value being stored is flipped. I can see how you could theoretically save memory using this method with features that have less possible values than the number of bits used to store an integer value on your target system but the data set I’m working with has many different values for various features so I don’t think that solution will help me at all.

What are some of the accepted methods of representing these values in vectors and when is each strategy the best choice?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T20:07:16+00:00

So there’s a convention to do this. It’s much easier to show by example than to explain.

Suppose you have have collected from your web analytics app, four sets of metrics describing each visitor to a web site:

sex/gender
acquisition channel
forum participation level
account type

Each of these is a categorical variable (aka factor) rather than a continuous variable (e.g., total session time, or account age).

# column headers of raw data--all fields are categorical ('factors')
col_headers = ['sex', 'acquisition_channel', 'forum_participation_level', 'account_type']

# a single data row represents one user
row1 = ['M', 'organic_search', 'moderator', 'premium_subscriber']

# expand data matrix width-wise by adding new fields (columns) for each factor level:
input_fields = [ 'male', 'female', 'new', 'trusted', 'active_participant', 'moderator', 
                 'direct_typein', 'organic_search', 'affiliate', 'premium_subscriber',   
                 'regular_subscriber',  'unregistered_user' ]

# now, original 'row1' above, becomes (for input to ML algorithm, etc.)
warehoused_row1 = [1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0]

This transformation technique seems more sensible to me than keeping each variable as a single column. For instance, if you do the latter, then you have to reconcile the three types of acquisition channels with their numerical representation–i.e., if organic search is a “1” should affiliate be a 2 and direct_typein a 3, or vice versa?

Another significant advantage of this representation is that it is, despite the width expansion, a compact representation of the data. (In instances where the column expansion is substantial–i.e., one field is user state, which might mean 1 column becomes 50, a sparse matrix representation is obviously a good idea.)

for this type of work i use the numerical computation libraries NumPy and SciPy.

from the Python interactive prompt:

>>> # create two data rows, representing two unique visitors to a Web site:

>>> row1 = NP.array([0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0])

>>> row2 = NP.array([1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0])

>>> row1.dtype
  dtype('int64')
>>> row1.itemsize
  8

>>> # these two data arrays can be converted from int/float to boolean, substantially 
>>> # reducing their size w/ concomitant performance improvement
>>> row1 = NP.array(row1, dtype=bool)
>>> row2 = NP.array(row2, dtype=bool)

>>> row1.dtype
  dtype('bool')
>>> row1.itemsize    # compare with row1.itemsize = 8, above
  1

>>> # element-wise comparison of two data vectors (two users) is straightforward:
>>> row1 == row2  # element-wise comparison
  array([False, False, False, False,  True,  True, False,  True,  True, False], dtype=bool)
>>> NP.sum(row1==row2)
  5

For similarity-based computation (e.g. k-Nearest Neighbors), there is a particular metric used for expanded data vectors comprised of categorical variables called the Tanimoto Coefficient. For the particular representation i have used here, the function would look like this:

def tanimoto_bool(A, B) :
    AuB = NP.sum(A==B)
    numer = AuB
    denom = len(A) + len(B) - AuB
    return numer/float(denom)

>>> tanimoto_bool(row1, row2)
  0.25

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Many data mining algorithms/strategies use vector representation of data records in order to simulate

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply