I have some 100.000+ text documents. I’d like to find a way to answer

Question

0

Editorial Team

Asked: May 25, 20262026-05-25T14:13:27+00:00 2026-05-25T14:13:27+00:00

I have some 100.000+ text documents. I’d like to find a way to answer

0

I have some 100.000+ text documents. I’d like to find a way to answer this (somewhat ambiguous) question:

For a given subset of documents, what are the n most frequent words – related to the full set of documents?

I’d like to present trends, eg. a word cloud showing something like “these are the topics that are especially hot in the given date range”. (Yes, I know that this is an oversimplification: words != topics etc.)

It seems that I could possibly calculate something like tf-idf values for all words in all documents, and then do some number crunching, but I don’t want to reinvent any wheels here.

I’m planning on possibly using Lucene or Solr for indexing the documents. Would they help me with this question – how? Or would you recommend some other tools in addition / instead?