I am using libsvm to predict sentiment. I wanted to know what format the input has to be in
assuming I am using word count.
[label] [index]:[value] [index]:[value]
That is required format from libsvm. So does that mean I just have two labels ( one for positive and one for negative), the index would be each word under that label and the value would be the frequency of each word ?
Does this also mean I need to store the mapping of word to index to use in my test set ?
LIBSVM uses the so called “sparse” format where zero values do not need to be stored. Hence a data with attributes
5 0 2 0
is represented as
1:5 3:2
Therefore, you only need to specifiy the index and the value of nonzero attributes.
Labels stand in the first column. For binary cases you may use +1 for positive and -1 for negative samples. By the way, you are not limited to only 2 labels. You can use other numbers (e.g. 1,2,3,4,5,…)