I am trying to write a function that will filter a list of tuples (mimicing an in-memory database), using a “nearest neighbour” or “nearest match” type algorithim.
I want to know the best (i.e. most Pythonic) way to go about doing this. The sample code below hopefully illustrates what I am trying to do.
datarows = [(10,2.0,3.4,100),
(11,2.0,5.4,120),
(17,12.9,42,123)]
filter_record = (9,1.9,2.9,99) # record that we are seeking to retrieve from 'database' (or nearest match)
weights = (1,1,1,1) # weights to approportion to each field in the filter
def get_nearest_neighbour(data, criteria, weights):
for each row in data:
# calculate 'distance metric' (e.g. simple differencing) and multiply by relevant weight
# determine the row which was either an exact match or was 'least dissimilar'
# return the match (or nearest match)
pass
if __name__ == '__main__':
result = get_nearest_neighbour(datarow, filter_record, weights)
print result
For the snippet above, the output should be:
(10,2.0,3.4,100)
since it is the ‘nearest’ to the sample data passed to the function get_nearest_neighbour().
My question then is, what is the best way to implement get_nearest_neighbour()?. For the purpose of brevity etc, assume that we are only dealing with numeric values, and that the ‘distance metric’ we use is simply an arithmentic subtraction of the input data from the current row.
Simple out-of-the-box solution:
If you’d need to work with huge datasets, you should consider switching to Numpy and using Numpy’s
KDTreeto find nearest neighbors. Advantage of using Numpy is that not only it uses more advanced algorithm, but also it’s implemented a top of highly optimized LAPACK (Linear Algebra PACKage).