I have some Issues in generating input training vector set for LIBSVM .. I have 3 categories and their relevent training document with term weight as follows(Only assumption).
(Label/Category):1
Term frequency Vector(TF*IDF)
Document1-> 1:0.25 2:1.056 3:2.356
Document2-> 2:1.25 3:0.145 4:1.543
Document3-> 1:1.00 2:2.145 5:3.543
(Label/Category):2
Term frequency Vector(TF*IDF)
Document4-> 1:0.25 2:1.056 3:2.356
Document5-> 2:1.25 3:0.145 4:1.543
Document6-> 1:1.00 2:2.145 5:3.543
(Label/Category):3
Term frequency Vector(TF*IDF)
Document7-> 1:0.25 2:1.056 3:2.356
Document8-> 2:1.25 3:0.145 4:1.543
Document9-> 1:1.00 2:2.145 5:3.543
Can any one say how to convert this into set of training vector for LIBSVM.Here 1:0.25 2:1.056 3:2.356 are term index and its weight.Term indices are maintained manually in global dictionary.
As well may I know how to convert the testing document into term vector?.
thanks in advance.
Hi Qnan.. I have prepared sample training vector space as you have suggested. Can you please tell me whether my vector formation is correct or not?..
(Label/Category):1
1 1:0.25 2:1.056 3:2.356 ->(training instance 1-for Document1)
1 2:1.25 3:0.145 4:1.543 ->(training instance 2-for Document2)
1 1:1.00 2:2.145 5:3.543 ->(training instance 3-for Document3)
(Label/Category):2
2 1:0.25 2:1.056 3:2.356 ->(training instance 4-for Document4)
2 2:1.25 3:0.145 4:1.543 ->(training instance 5-for Document5)
2 1:1.00 2:2.145 5:3.543 ->(training instance 6-for Document6)
(Label/Category):3
3 1:0.25 2:1.056 3:2.356 ->(training instance 7-for Document7)
3 2:1.25 3:0.145 4:1.543 ->(training instance 8-for Document8)
3 1:1.00 2:2.145 5:3.543 ->(training instance 9-for Document9)
The format is described in the README file of the LIBSVM distribution, basically it is
one line per training instance. The feature indices should be in ascending order, too.
The test set looks exactly the same, except that the first column may contain some fixed number, e.g. 0, if you do not know the true labels for that set.
As for you data, I don’t quite see how you can have all those different weight vectors for the same
Document1and the same set of terms. Could you clarify that?EDIT:
The format is OK, if you remove the comments, LIBSVM runs just fine. Assuming you’re running Windows and the file
test.txtis as follows,you can use
./libsvm-3.12/windows/svm-train.exe test.txtfor training and./libsvm-3.12/windows/svm-predict.exe test.txt test.txt.model test.txt.outfor prediction. On other systems the CMD is similar.Note that with this data the accuracy won’t be higher than 1/3, since the same weight vectors are present in the dataset with each of the labels.