I have a question about k-means clustering in R. Actually i’m doing everything according to this article. Everything is based on examples within the tm package so it’s required no data import. acq contains 50 documents and crude 20 documents.
library(tm)
data("acq")
data("crude")
ws <- c(acq, crude)
wsTDM <- Data(TermDocumentMatrix(ws)) #First problem here
wsKMeans <- kmeans(wsTDM, 2)
wsReutersCluster <- c(rep("acq", 50), rep("crude", 20))
cl_agreement(wsKMeans, as.cl_partition(wsReutersCluster), "diag")
Error in lapply(X, FUN, ...) :
(list) object cannot be coerced to type 'integer'
I actually want to create cross agreement matrix. But this article was wrote in 2008 since then a lot have changed. The Data function is only available in RSurvey package, but i’m kinda doubt is it the same. And i think that the main problem is that TermDocumentMatrix was S4 class and now it’s S3. I know it’s possibly to do this having text only. But I wanna do it like this since in TDM it’s possible to remove stopwords, punct, etc for better results. So if someone has any solution that would be terrific.
The TDM is stored as a sparse matrix, as described in
?TermDocumentMatrix. This can also be seen from just inspecting the object likestr(wsTDM). That oldData()function was just a way to access the contents as a regular matrix. It is not needed anymore. Just dokmeans(wsTDM, 2)and you’ll see that the output is as expected, with clusters identified for 2775 observations (terms) on 70 features (documents). Good luck!