Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6055741
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T08:17:54+00:00 2026-05-23T08:17:54+00:00

Problem: Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit

  • 0

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T08:17:55+00:00Added an answer on May 23, 2026 at 8:17 am

    Question: What do we know about the Hamming distance d(x,y)?

    Answer:

    1. It is non-negative: d(x,y) ≥ 0
    2. It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
    3. It is symmetric: d(x,y) = d(y,x)
    4. It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

    Question: Why do we care?

    Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

    • Metric tree (Wikipedia)
    • BK-tree (Wikipedia)
    • M-tree (Wikipedia)
    • VP-tree (Wikipedia)
    • Cover tree (Wikipedia)

    You can also look up algorithms for “spatial indexing” in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

    Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

    static inline int distance(unsigned x, unsigned y)
    {
        return __builtin_popcount(x^y);
    }
    

    If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

    Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform’s GCC’s __builtin_popcount by about 160%.

    Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are “shells” of trees containing points that are each a fixed distance from the tree’s center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node’s center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick “shells” instead of many finer ones.

    The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

    • Database: 100M pseudorandom integers
    • Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
    • Results: Average # of query hits (very approximate)
    • Speed: Number of queries per second
    • Coverage: Average percentage of database examined per query
                    -- BK Tree --   -- VP Tree --   -- Linear --
    Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
    1          0.90 3800     0.048% 4200     0.048%
    2         11     300     0.68%   330     0.65%
    3        130      56     3.8%     63     3.4%
    4        970      18    12%       22    10%
    5       5700       8.5  26%       10    22%
    6       2.6e4      5.2  42%        6.0  37%
    7       1.1e5      3.7  60%        4.1  54%
    8       3.5e5      3.0  74%        3.2  70%
    9       1.0e6      2.6  85%        2.7  82%
    10      2.5e6      2.3  91%        2.4  90%
    any                                             2.2     100%
    

    In your comment, you mentioned:

    I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

    I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being “deeper” rather than “shallower”, it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

    A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Problem: Given a list of strings, find the substring which, if subtracted from the
Here's the jist of the problem: Given a list of sets, such as: [
Here's the problem. Given a large/intricate datatemplate A, which has 3 sections - General,
This problem is really confusing me; we're given two integers A , B ,
I have an optimization problem as follows. Given an array of positive integers, e.g.
Here is a seemingly simple problem: given a list of iterators that yield sequences
I am dealing with a problem of text summarization i.e. given a large chunk(s)
Actually, given N a (possibly very large) even integer, I want to find N
In this problem r is a fixed positive integer. You are given N rectangles,
Problem Given the following two tables, I'd like to select all Ids for Posts

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.