Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8269211
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T06:08:47+00:00 2026-06-08T06:08:47+00:00

I need to run a clustering algorithm on a set of consumer data but

  • 0

I need to run a clustering algorithm on a set of consumer data but I am not sure how to handle text based fields (perfect example being alpha numeric post codes, such as SE1 8XR).

Apparently what I need to use is a string kernel, which I understand the basic idea of, but not well enough to implement successfully.

Ideally I would like the new numeric vector to encode the idea that the more dissimilar two post codes are the further the data points are from each other, but have no idea how to do this and I’ve not been able to find a single useful tutorial, guide or textbook!

Also I am doing this in Python, in case there is some library anyone knows of that may be useful.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T06:08:50+00:00Added an answer on June 8, 2026 at 6:08 am

    re postcodes

    You can’t compare postcodes as strings. `AL1 1AA’ is St. Albans, and ‘AB1 1AA’ is Aberdeen. They’re remarkably close in edit distance, but CR6 7DX is closer to St. Albans 🙂

    Your best approach is probably to grab some lookup table (I know you can buy them off Royalmail) perhaps from http://www.ordnancesurvey.co.uk/oswebsite/products/os-opendata.html which takes a postcode, or at least a sector ‘AL1 1’ (maybe even district ‘AL1’) for instance, and maps that to lat/lon, which you then use to bucket data.

    other strings

    One possible option would be to use difflib.SequenceMatcher which returns a percentage score of how similar two strings are to each other (there are plenty of other algorithms out there: google “genetic string algorithms”, “fuzzy string matching”, “string similarity algorithms” etc…). Group all strings that are (maybe) 80% similar to each other, and assign those a group – then cluster on that group.

    You may also find metaphone & double metaphone (and possibly ntlk) useful depending on the complexity of your requirements and data.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

What are the web-based clients for XMPP/Jabber. I need run IM client in browser
I need to run a bash script as root (passwordless sudo or su not
I need to run an if isset statement but i can't seem to get
I need help selecting or creating a clustering algorithm according to certain criteria. Imagine
I need to run a JavaScript function onLoad(), but only do it if the
i need run code that will create a database and populate tables. i am
I need run ts:reindex when smth add in model or destroy from model. How
I need to run the application that I have made in Xcode on my
I need to run two ffmpeg commands, one after the other i.e., wait until
I need to run application in every X seconds, so, as far as cron

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.