Okay if you would like to know how this numpy array is created look at these questions
- How can I add items to collection.Counter? and then sort them into ASC?
- Porter Stemmer Algorithm Not returning the expected output? when modified into def
Lets presume I have a numpy array that looks like (created from this after preprocessing, the array below has also been shuffled by numpy so the result is random)
[[ 3 2 2 ..., 0 0 0]
[14 1 0 ..., 0 0 0]
[ 3 2 1 ..., 0 0 0]
...,
[ 1 1 1 ..., 0 0 0]
[ 2 2 2 ..., 0 0 0]
[ 1 1 0 ..., 0 0 0]]
I understand the data is large, it consists of 600 emails (each email consisting of about 2000 words) with statistics on 196 common spam words from around the internet on each email.
I would like to use it within Stephen Marsland K-Means Neurel Network that states in the comments "won't work if (0,0,...0) is in data" but I’m unsure on what this comment is referring to? (I thought that ... symbol was the same as it is in Math e.g. 1...n, there is stuff between 1 and n). If it means something else how should I tackle the problem with the invalid divide? thanks!
I’m not sure but something in my data set is causing this error
RuntimeWarning: invalid value encountered in divide data = transpose(transpose(data)/normalisers)
The
...,just indicates that there’s too much data to print to the console at once. I think the comment in the code you link to means that the preprocessing won’t work if there are any all-zero vectors in your data.The following will return
Trueif you have any all-zero feature vectors……assuming that your rows are data examples and your columns are features.
Here’s what’s happening:
And then you’re dividing by zero.
One option might be to simply add one to every feature prior to the preprocessing step: