X is a text file that contains 100000 equal size (500 elements) bit vector (i.e. each row is a vector of 500 elements). I am generating an adjacency matrix (100000 X 100000) using the code below, but its not optimized and very time consuming. How can I improve that.
import numpy as np
import scipy.spatial.distance
readFrom = "vector.txt"
fout = open("adjacencymatrix.txt","a")
X = np.genfromtxt(readFrom, dtype=None)
for outer in range(0,100000):
for inner in range(0,100000):
dis = scipy.spatial.distance.euclidean(X[outer],X[inner])
tmp += str(dis)+" "
tmp += "\n"
fout.write(tmp)
fout.close()
Thank you.
Edit: Complete rewrote after understanding the question better. Given the size of the data, etc. this one is tricky. I got my best results at speedup with the following so far:
So I tried balancing the size of each chunk of the dataset vs. the memory overhead. This got me down to an estimated 6,600 secs to finish, or ~110 mins. You can see I also started seeing if I could parallelize using the multiprocessing pool. My strategy would have been to asynchronously process each chunk and save them to a different text file, then concatenate the files aftwerwards, but I got to get back to work.