Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6922991
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T10:28:42+00:00 2026-05-27T10:28:42+00:00

I have a dataset with a column that is currently being treated as a

  • 0

I have a dataset with a column that is currently being treated as a factor with 1000+ levels. These are values for the column. I would like to clean up this data.
Some values are strings like “-18 + 5 = -13” and “5 – 18 = -13”, I would like the clustering to group these differently than say “R3no4”.

Is this possible in R? I looked at the natural language processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html but I need to be pushed in the right direction.

the dataset is from the kdd 2010 cup
I would like to create meaningful new columns from this column to aid in creating a predictive model. for example it would be nice to know if the string contains a certain operation, or if it contains no operations and instead is describing the problem.

my data frame looks like this:

str(data1)
'data.frame':   809694 obs. of  19 variables:
 $ Row                        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Anon.Student.Id            : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7     7 7 7 7 7 7 7 7 ...
 $ Problem.Hierarchy          : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ...
 $ Problem.Name               : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ...
 $ Problem.View               : int  1 1 1 1 2 2 3 3 4 4 ...
 $ Step.Name                  : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...

I’m most interested in the Step.Name feature, since it contains the greatest number of unique factor values.

and some example values for step name:

[97170] (1+7)/4 = x                                                               
[97171] (1-sqrt(1^2-4*2*-6))/4 = x                                                
[97172] (1-sqrt(1^2-(-48)))/4 = x                                                 
[97173] (1-sqrt(1-(-48)))/4 = x                                                   
[97174] (1-sqrt(49))/4 = x                                                        
[97175] (1-7)/4 = x                                                               
[97176] x^2+15x+44 = 0                                                            
[97177] a-factor-node                                                             
[97178] b-factor-node                                                             
[97179] c-factor-node                                                             
[97180] num1-factor-node                                                          
[97181] num2-factor-node                                                          
[97182] den1-factor-node                                                          
[97183] (-15?sqrt((-15)^2-4*1*44))/2 = x                                          
[97184] (-15+sqrt((-15)^2-4*1*44))/2 = x                                          
[97185] (-15+sqrt((-15)^2-176))/2 = x                                             
[97186] (-15+sqrt(225-176))/2 = x                                                 
[97187] (-15+sqrt(49))/2 = x                                                      
[97188] (-15+7)/2 = x                                                             
[97189] (-15-sqrt((-15)^2-4*1*44))/2 = x                                          
[97190] (-15-sqrt((-15)^2-176))/2 = x                                             
[97191] (-15-sqrt(225-176))/2 = x                                                 
[97192] (-15-sqrt(49))/2 = x                                                      
[97193] (-15-7)/2 = x                                                             
[97194] 2x^2+x = 0                                                                
[97195] a-factor-node                                                             
[97196] b-factor-node                                                             
[97197] c-factor-node                                                             
[97198] num1-factor-node                                                          
[97199] num2-factor-node                                                          
[97200] den1-factor-node                                                          
[97201] (-1?sqrt((-1)^2-4*2*0))/4 = x                                             
[97202] (-1+sqrt((-1)^2-4*2*0))/4 = x                                             
[97203] (-1+sqrt((-1)^2-0))/4 = x                                                 
[97204] (-1+sqrt((-1)^2))/4 = x                                                   
[97205] (-1+1)/4 = x                                                              
[97206] (-1-sqrt((-1)^2-4*2*0))/4 = x                                             
[97207] (-1-sqrt((-1)^2-0))/4 = x                                                 
[97208] (-1-sqrt((-1)^2))/4 = x                                                   
[97209] (-1-1)/4 = x                                                              
[97210] x^2-6x = 0                                                                
[97211] a-factor-node                                                             
[97212] b-factor-node                                                                
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T10:28:42+00:00Added an answer on May 27, 2026 at 10:28 am

    Clustering is just scoring each instance in a data array according to some metric, sorting the data array according to this calculated score, then slicing into some number of segments, assigning a label each one.

    In other words, you can cluster any data for which you can formulate some meaningful function to calculate similarity of each data point w/r/t the others; this is usually referred to as a similarity metric.

    There are a lot of these, but only a small subset of them are useful to evaluate strings. Of these, perhaps the most commonly used is Levenshtein Distance (aka Edit Distance).

    This metric is expressed as an integer, and it increments one unit (+1) for each ‘edit’–inserting, deleting, or changing a letter–required to transform one word into another. Summing those individual edits (one for each letter) gives you the Levenshtein Distance.

    The R Package vwr includes an implementation:

    > library(vwr)
    > levenshtein.distance('cat', 'hat')
        hat 
        1 
    > levenshtein.distance('cat', 'catwalk')
        catwalk 
        4 
    > levenshtein.distance('catwalk', 'sidewalk')
        sidewalk 
        4
    
    > # using a data set supplied with the vmr library 
    > EW = english.words
    > ew1 = sample(EW, 20)     # random select 20 words from EW
    > # the second argument is a vector of words, returns a vector of distances
    > dx = levenshtein.distance('cat', ew1)
    > dx
    furriers       graves      crooned    cursively       gabled   caparisons   drainpipes 
        8            5            6            8            5            8            9 
    patricians     medially     beholder   chirpiness    fluttered     bobolink   lamentably 
        8            7            8            9            8            8            8 
    depredations      alights    unearthed     thimbles    supersede   dissembler 
        10            6            7            8            9           10
    

    While Levenshtein Distance can be used to cluster your data, whether it should be used for your data is a question i’ll leave to you (i.e., the primary use case for L/D is clearly pure text data).

    (Perhaps the next-most-common similarity metric that operates on strings is Hamming Distance. Hamming Distance (unlike Levenshtein) requires that the two strings be of equal length, hence it won’t work for your data.)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I currently have a dataset which has two columns that I'd like to compare.
I have a DataSet that contains a few columns. One of these columns is
I have a DataSet which I get a DataTable from that I am being
I have a dataset that looks like this: 0 _ _ 23.0186E-03 10 _
I have a report with a dataset that has a column with booleans. In
I have a DataGridView that's bound to a DataSet. It has columns DateCreated, Weight,
I have a textbox with the Text property bound to a dataset column with
I have saved a dataset in the sql database in an xml column using
Previously I've asked about inserting a column into a dataset . I now have
I have a dataset that I have modified into an xml document and then

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.