I am trying to filter stop-words from the following documents using package tm .

Question

0

Asked: May 28, 20262026-05-28T13:49:32+00:00 2026-05-28T13:49:32+00:00

I am trying to filter stop-words from the following documents using package tm .

0

I am trying to filter stop-words from the following documents using package tm.

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))

However, when I run this code I still get the following in the DocumentTermMatrix.

colnames(matrix)
[1] "brown"  "dog"    "fox"    "jumps"  "lazy"   "over"   "quick"  "the"    "walrus"

“The” is listed as a stop-word in the list that package tm uses. Am I doing something wrong regarding the stopwords parameter, or is this a bug in the tm package?

EDIT: I contacted Ingo Feinerer and he noted that it is technically not a bug:

User-provided options are processed first, and then all remaining
options. Hence stopword removal is done before tokenization (as
already written by Vincent Zoonekynd on stackoverflow.com) which gives
exactly your result.

Therefore, the solution is to explicitly list the default tokenizing option prior to the stopwords parameter, for example:

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE))
colnames(matrix)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T13:49:33+00:00

Editorial Team

2026-05-28T13:49:33+00:00Added an answer on May 28, 2026 at 1:49 pm

It is a bug: you may want to report it to the package author(s). The termFreq function applies various filters to the texts, but not always in the right order. In your example, the code attempts to remove the stopwords before tokenization, i.e., before the text is cut into words — it should be after, once we know what the words are.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to filter stop-words from the following documents using package tm .

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply