I am experimenting with different kinds of non-linear kernels and am trying to interpret the learned models, which led me to the following question: Is there a generic method for getting the primal weights of a non-linear Support Vector machine similar to how this is possible for linear SVMs (see related question)?
Say, you have three features a, b, c and the generated model of an all-subsets/polynomial kernel. Is there a way to extract the primal weight of those subsets, e.g., a * b and a^2?
I’ve tried extending the method for linear kernels, where you generate output for the following samples:
a, b, c
[0, 0, 0]
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
If I use the same approach for the all-subsets kernel, I can generate some more samples:
a, b, c
[1, 1, 0]
[1, 0, 1]
...
Next, to calculate the primal weight for a * b, I analyse the predictions as follows: [1, 1, 0] - ([1, 0, 0] + [0, 1, 0] + [0, 0, 0]).
The problem I see with this is that it requires a prohibitive number of samples, doesn’t address the subsets such as a^2 and it doesn’t generalise to other non-linear kernels.
No. I don’t claim to be the end-all-be-all expert on this, but I’ve done a lot of reading and research on SVM and I do not think what you are saying is possible. Sure, in the case of the 2nd degree polynomial kernel you can enumerate the feature space induced by the kernel, if the number of attributes is very small. For higher-order polynomial kernels and larger numbers of attributes this quickly becomes intractable.
The power of the non-linear SVM is that it is able to induce feature spaces without having to do computation in that space, and in fact without actually knowing what that feature space is. Some kernels can even induce an infinitely dimensional feature space.
If you look back at your question, you can see part of the issue – you are looking for the primal weights. However, the kernel is something that is introduced in the dual form, where the data shows up as a dot product. Mathematically reversing this process would involve breaking the kernel function apart – knowing the mapping function from input space to feature space. Kernel functions are powerful precisely because we do not need to know this mapping. Of course it can be done for linear kernels, because there is no mapping function used.