Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6153601
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T20:07:15+00:00 2026-05-23T20:07:15+00:00

Many data mining algorithms/strategies use vector representation of data records in order to simulate

  • 0

Many data mining algorithms/strategies use vector representation of data records in order to simulate a spatial representation of the data (like support vector machines).

My trouble comes from how to represent non-numerical features within the dataset. My first thought was to ‘alias’ each possible value for a feature with a number from 1 to n (where n is the number of features).

While doing some research I came across a suggestion that when dealing with features that have a small number of possible values that you should use a bit string of length n where each bit represents a different value and only the one bit corresponding to the value being stored is flipped. I can see how you could theoretically save memory using this method with features that have less possible values than the number of bits used to store an integer value on your target system but the data set I’m working with has many different values for various features so I don’t think that solution will help me at all.

What are some of the accepted methods of representing these values in vectors and when is each strategy the best choice?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T20:07:16+00:00Added an answer on May 23, 2026 at 8:07 pm

    So there’s a convention to do this. It’s much easier to show by example than to explain.

    Suppose you have have collected from your web analytics app, four sets of metrics describing each visitor to a web site:

    1. sex/gender

    2. acquisition channel

    3. forum participation level

    4. account type

    Each of these is a categorical variable (aka factor) rather than a continuous variable (e.g., total session time, or account age).

    # column headers of raw data--all fields are categorical ('factors')
    col_headers = ['sex', 'acquisition_channel', 'forum_participation_level', 'account_type']
    
    # a single data row represents one user
    row1 = ['M', 'organic_search', 'moderator', 'premium_subscriber']
    
    # expand data matrix width-wise by adding new fields (columns) for each factor level:
    input_fields = [ 'male', 'female', 'new', 'trusted', 'active_participant', 'moderator', 
                     'direct_typein', 'organic_search', 'affiliate', 'premium_subscriber',   
                     'regular_subscriber',  'unregistered_user' ]
    
    # now, original 'row1' above, becomes (for input to ML algorithm, etc.)
    warehoused_row1 = [1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0]
    

    This transformation technique seems more sensible to me than keeping each variable as a single column. For instance, if you do the latter, then you have to reconcile the three types of acquisition channels with their numerical representation–i.e., if organic search is a “1” should affiliate be a 2 and direct_typein a 3, or vice versa?

    Another significant advantage of this representation is that it is, despite the width expansion, a compact representation of the data. (In instances where the column expansion is substantial–i.e., one field is user state, which might mean 1 column becomes 50, a sparse matrix representation is obviously a good idea.)

    for this type of work i use the numerical computation libraries NumPy and SciPy.

    from the Python interactive prompt:

    >>> # create two data rows, representing two unique visitors to a Web site:
    
    >>> row1 = NP.array([0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0])
    
    >>> row2 = NP.array([1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0])
    
    >>> row1.dtype
      dtype('int64')
    >>> row1.itemsize
      8
    
    >>> # these two data arrays can be converted from int/float to boolean, substantially 
    >>> # reducing their size w/ concomitant performance improvement
    >>> row1 = NP.array(row1, dtype=bool)
    >>> row2 = NP.array(row2, dtype=bool)
    
    >>> row1.dtype
      dtype('bool')
    >>> row1.itemsize    # compare with row1.itemsize = 8, above
      1
    
    >>> # element-wise comparison of two data vectors (two users) is straightforward:
    >>> row1 == row2  # element-wise comparison
      array([False, False, False, False,  True,  True, False,  True,  True, False], dtype=bool)
    >>> NP.sum(row1==row2)
      5
    

    For similarity-based computation (e.g. k-Nearest Neighbors), there is a particular metric used for expanded data vectors comprised of categorical variables called the Tanimoto Coefficient. For the particular representation i have used here, the function would look like this:

    def tanimoto_bool(A, B) :
        AuB = NP.sum(A==B)
        numer = AuB
        denom = len(A) + len(B) - AuB
        return numer/float(denom)
    
    >>> tanimoto_bool(row1, row2)
      0.25
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to develop a tool for web log data mining. Having many sequences
I'm working on a data mining research project and use code from a big
Many data analysts that I respect use version control. For example: http://github.com/hadley/ See comments
I have some problems with the following: I would like to plot many data
I have very wide data frame and would like to create many short data
Following code iterates through many data-rows, calcs some score per row and then sorts
Let's consider a data type with many constructors: data T = Alpha Int |
How many bytes of data does a typical HTTP get request consume. For instance
I am displaying many rows of data in a list view that is bound
how to select all the data from many tables? i try `SELECT * FROM

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.