I’m having trouble using vq.whiten from scipy.cluster to normalise my data. I’m passing in a numpy array which has had missing feature values filled in with the average for each feature.
The line it gets stuck on is:
data = scipy.cluster.vq.whiten(self.imputed)
This is the code I’m using to replace the missing data.
imputed = np.array([self.masked[:,i].filled(self.masked[:,i].mean())
for i in range(np.shape(self.masked)[1])])
self.imputed = np.transpose(imputed)
I’m sure there’s a better way of doing this part too, quite apart from the fact it seems to be breaking my code. It seems an ugly way of going about it and that normally means there’s a better way with Python.
I’ve tried slicing down how much of the array I send to whiten but no matter what I get the following in the Traceback.
Traceback (most recent call last):
File "C:\Users\jamie.bull\workspace\Metadata\src\draft_workflow.py", line 87, in <module>
dataset.cluster()
File "C:\Users\jamie.bull\workspace\Metadata\src\draft_workflow.py", line 59, in cluster
data = scipy.cluster.vq.whiten(self.imputed)
File "C:\Enthought\Python27\lib\site-packages\scipy\cluster\vq.py", line 131, in whiten
std_dev = std(obs, axis=0)
File "C:\Enthought\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2467, in std
return std(axis, dtype, out, ddof)
AttributeError: sqrt
The clustering works fine with the same dataset without any missing data so I’m at a loss for what to try next.
Edit:
I tried printing out the type of each item in imputed for both the full data set and the one with missing data using:
for item in imputed:
print type(item)
The difference between the two is that when the version which hasn’t had the mean substitution and transpose called on it has one numpy.ndarray for each row while the one which has been mean substituted has one for each column.
I’ve solved this one now so I’ll put the answer here for future lost souls. The problem was that my mean replacement was replacing the missing values with
floats when the original data was stored asnumpy.float64.The solution is to run the list comprehension and follow it by setting the
dtypetonp.float64. It seems thatwhitendoesn’t like to receive mixed data types.Also, solving the ugliness problem of having to transpose after the list comprehension, I rediscovered
np.column_stack(). The working function is now:Edited to add
A long time ago now but I thought I’d update here. I would now use pandas for data handling and use
pandasandfill_na()for this situation.The offending line in the OP could be replaced with: