Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7017791
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T22:57:38+00:00 2026-05-27T22:57:38+00:00

Problem: I have N (~100k-1m) strings each D (e.g. 2000) characters long and with

  • 0

Problem:

I have N (~100k-1m) strings each D (e.g. 2000) characters long and with a low alphabet (eg 3 possible characters). I would like to sort these strings such that there are as few possible changes between adjacent strings (eg hamming distance is low). Solution doesn’t have to be the best possible but closer the better.

Example

N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba

//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)

Thoughts about the problem

I have a bad feeling this is a non trivial problem. If we think of each string as a node and the distances to other strings as an edge, then we are looking at a travelling salesman problem. The large number of strings means that calculating all of the pairwise distances beforehand is potentially infeasible, I think turning the problem into some more like the Canadian Traveller Problem.

At the moment my solution has been to use a VP tree to find a greedy nearest neighbour type solution to the problem

curr_string = a randomly chosen string from full set
while(tree not empty)
    found_string = find nearest string in tree
    tree.remove(found_string)
    sorted_list.add(curr_string)
    curr_string = found_string

but initial results appear to be poor. Hashing strings so that more similar ones are closer may be another option but I know little about how good a solution this will provide or how well it will scale to data of this size.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T22:57:38+00:00Added an answer on May 27, 2026 at 10:57 pm

    Even if you consider this problem as similar to the travelling salesman problem (TSP), I believe that Hamming distances will follow the triangle inequality (Hamming(A,B) + Hamming(B,C) ≤ Hamming(A,C)), so you’re only really dealing with ∆TSP (the metric travelling salesman problem), for which there are a number of algorithms which give good approximations at an ideal result. In particular, the Christofides algorithm will always give you a path of at most 1.5x the minimum possible length.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Problem: I have two spreadsheets that each serve different purposes but contain one particular
Problem -- I have something like the following entries, 1000 of them: args1=msg args2=flow
If I have large number of files (n x 100K individual files) what would
Problem: I have an address field from an Access database which has been converted
Problem I have timestamped data, which I need to search based on the timestamp
Problem: We have a web app that calls some web services asynchronously (from the
Problem: I have to support users who need to edit web pages. Some of
Problem: I have a python script that I have running as a service. It's
Problem : I have a hand held device that scans those graphic color barcodes
Problem: I have a WCF service setup to be an endpoint for a call

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.