I’m having trouble using vq.whiten from scipy.cluster to normalise my data. I’m passing in

Question

0

Asked: June 14, 20262026-06-14T22:08:26+00:00 2026-06-14T22:08:26+00:00

I’m having trouble using vq.whiten from scipy.cluster to normalise my data. I’m passing in

0

I’m having trouble using vq.whiten from scipy.cluster to normalise my data. I’m passing in a numpy array which has had missing feature values filled in with the average for each feature.

The line it gets stuck on is:

data = scipy.cluster.vq.whiten(self.imputed)

This is the code I’m using to replace the missing data.

imputed = np.array([self.masked[:,i].filled(self.masked[:,i].mean()) 
                   for i in range(np.shape(self.masked)[1])])
self.imputed = np.transpose(imputed)

I’m sure there’s a better way of doing this part too, quite apart from the fact it seems to be breaking my code. It seems an ugly way of going about it and that normally means there’s a better way with Python.

I’ve tried slicing down how much of the array I send to whiten but no matter what I get the following in the Traceback.

Traceback (most recent call last):
  File "C:\Users\jamie.bull\workspace\Metadata\src\draft_workflow.py", line 87, in <module>
    dataset.cluster()
  File "C:\Users\jamie.bull\workspace\Metadata\src\draft_workflow.py", line 59, in cluster
    data = scipy.cluster.vq.whiten(self.imputed)
  File "C:\Enthought\Python27\lib\site-packages\scipy\cluster\vq.py", line 131, in whiten
    std_dev = std(obs, axis=0)
  File "C:\Enthought\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2467, in std
    return std(axis, dtype, out, ddof)
AttributeError: sqrt

The clustering works fine with the same dataset without any missing data so I’m at a loss for what to try next.

Edit:
I tried printing out the type of each item in imputed for both the full data set and the one with missing data using:

for item in imputed:
    print type(item)

The difference between the two is that when the version which hasn’t had the mean substitution and transpose called on it has one numpy.ndarray for each row while the one which has been mean substituted has one for each column.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T22:08:27+00:00

I’ve solved this one now so I’ll put the answer here for future lost souls. The problem was that my mean replacement was replacing the missing values with floats when the original data was stored as numpy.float64.

The solution is to run the list comprehension and follow it by setting the dtype to np.float64. It seems that whiten doesn’t like to receive mixed data types.

Also, solving the ugliness problem of having to transpose after the list comprehension, I rediscovered np.column_stack(). The working function is now:

def mean_impute(self):
    imputed = np.column_stack(self.masked[:,i].filled(self.masked[:,i].mean()) 
               for i in range(np.shape(self.masked)[1]))
    self.imputed = np.array(imputed, dtype=np.float64)

Edited to add

A long time ago now but I thought I’d update here. I would now use pandas for data handling and use pandas and fill_na() for this situation.

The offending line in the OP could be replaced with:

imputed = self.masked.fillna(self.masked.mean())

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m having trouble using vq.whiten from scipy.cluster to normalise my data. I’m passing in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply