I am trying to filter stop-words from the following documents using package tm.
library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))
However, when I run this code I still get the following in the DocumentTermMatrix.
colnames(matrix)
[1] "brown" "dog" "fox" "jumps" "lazy" "over" "quick" "the" "walrus"
“The” is listed as a stop-word in the list that package tm uses. Am I doing something wrong regarding the stopwords parameter, or is this a bug in the tm package?
EDIT: I contacted Ingo Feinerer and he noted that it is technically not a bug:
User-provided options are processed first, and then all remaining
options. Hence stopword removal is done before tokenization (as
already written by Vincent Zoonekynd on stackoverflow.com) which gives
exactly your result.
Therefore, the solution is to explicitly list the default tokenizing option prior to the stopwords parameter, for example:
library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE))
colnames(matrix)
It is a bug: you may want to report it to the package author(s). The
termFreqfunction applies various filters to the texts, but not always in the right order. In your example, the code attempts to remove the stopwords before tokenization, i.e., before the text is cut into words — it should be after, once we know what the words are.