I am writing a function to select randomly elements stored in a dictionary:
import random
from liblas import file as lasfile
from collections import defaultdict
def point_random_selection(list,k):
try:
sample_point = random.sample(list,k)
except ValueError:
sample_point = list
return(sample_point)
def world2Pixel_Id(x,y,X_Min,Y_Max,xDist,yDist):
col = int((x - X_Min)/xDist)
row = int((Y_Max - y)/yDist)
return("{0}_{1}".format(col,row))
def point_GridGroups(inFile,X_Min,Y_Max,xDist,yDist):
Groups = defaultdict(list)
for p in lasfile.File(inFile,None,'r'):
id = world2Pixel_Id(p.x,p.y,X_Min,Y_Max,xDist,yDist)
Groups[id].append(p)
return(Groups)
where k is the number of element to select. Groups is the dictionary
file_out = lasfile.File("outPut",mode='w',header= h)
for m in Groups.iteritems():
# select k point for each dictionary key
point_selected = point_random_selection(m[1],k)
for l in xrange(len(point_selected)):
# save the data
file_out.write(point_selected[l])
file_out.close()
My problem is that this approach is extremely slow (for file of ~800 Mb around 4 days)
You could try and update your samples as you read the coordinates. This at least saves you from having to store everything in memory before running your sample. This is not guaranteed to make things faster.
The following is based off of BlkKnght’s excellent answer to build a random sample from file input without retaining all the lines. This just expanded it to keep multiple samples instead.
The above function takes
inFileand your boundary coordinates, as well as the sample sizen, and returns grouped samples that have at mostnitems in each group, picked uniformly.Because all you use the
idfor is as a group key, I reduced it to only calculating thecol, rowtuple, there is no need to make it a string.You can write these out to a file with: