Here is the main problem. I have very large database (25,000 or so) of

Question

0

Asked: May 15, 20262026-05-15T19:11:47+00:00 2026-05-15T19:11:47+00:00

Here is the main problem. I have very large database (25,000 or so) of

0

Here is the main problem. I have very large database (25,000 or so) of 48 dimensional vectors, each populated with values ranging from 0-255. The specifics are not so important but I figure it might help give context.

I don’t need a nearest neighbor, so approximate neighbor searches that are within a degree of accuracy are acceptable. I’ve been toying around with Locality Sensitivity Hashing but I’m very very lost.

I’ve written a hash function as described in the article under “Stable Distributions” as best I can. Here is the code.

def lsh(vector, mean, stdev, r = 1.0, a = None, b = None):
 if not a:
  a = [normalvariate(mean, stdev) for i in range(48)]
 if not b:
  b = uniform(0, r)
 hashVal = (sum([a[i]*vectorA[i] for i in range(48)]) + b)/r
 return hashVal

The hashing function is ‘working’ at least some. If I order a list of points by hash value and compute average distance between a point and it’s neighbor in the list, the average distance is about 400, compared to an average distance of about 530 for any two randomly selected points.

My biggest questions are these.

A: Any suggestions on where I can read more about this. My searching hasn’t produced a lot of results.

B: The method suggests it outputs an integer value (which mine does not). And then you’re supposed to try to find matches for this integer value and a match denotes a likely nearest neighbor. I understand I’m supposed to compute some set of tables of hash values for all my points and then check said tables for hash matches, but the values I’m returning don’t seem to be fine enough that I’ll end up with matches at all. More testing is needed on my part.

C: Instructions on how to construct hash functions based on the other hashing methods?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T19:11:48+00:00

Maby this is a little off topic but you can try using PCA http://en.wikipedia.org/wiki/Principal_component_analysis for reducing the dimensionality of the dataset. There should be plenty of PCA modules designed for numPy ( for example: http://folk.uio.no/henninri/pca_module/).
The method is rather simple and with a ready to use modules it will be a snap.

Basicly what it does is reduce the number of dimensions ( you should be able to specify desired number) by maximizing variance within the given number of dimensions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here is the main problem. I have very large database (25,000 or so) of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply