I’m doing some research and I’m playing with Apache Mahout 0.6 My purpose is

Question

0

Asked: June 5, 20262026-06-05T18:24:08+00:00 2026-06-05T18:24:08+00:00

I’m doing some research and I’m playing with Apache Mahout 0.6 My purpose is

0

I’m doing some research and I’m playing with Apache Mahout 0.6

My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don’t know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories.

For example:
Lets say I’ve collected a N documents, that belong to 3 different groups :

Politics
Madonna (pop-star)
Science fiction

I don’t know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs)

So, I came up with the following idea:

Apply mahout clustering (for example k-mean with k=3 on these documents)
This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don’t know which document really belongs to which group, but at least the documents are clustered now by group
Ask the user to find any document in the web that should be about ‘Madonna’ (I can’t show to the user none of my N documents, its a restriction). Then I want to measure ‘similarity’ of this document and each one of 3 groups.
I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics.

I’ve managed to produce the cluster of documents using ‘Mahout in Action’ book.
But I don’t understand how should I use Mahout to measure similarity between the ‘ready’ cluster group of document and one given document.

I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that?

Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good)

Thanks a lot and sorry for a long question (couldn’t describe it better)

Any help is highly appreciated

P.S. This is not a home-work project 🙂

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T18:24:10+00:00

Editorial Team

2026-06-05T18:24:10+00:00Added an answer on June 5, 2026 at 6:24 pm

This might be possible, but a much easier solution (IMHO) would be to hand-label a few documents from each category, then use those to bootstrap k-means. I.e., compute the centroids of the hand-labeled politics/Madonna/scifi documents and let k-means take it from there.

(In fancy terms, you would be doing semisupervised nearest centroids classification.)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m doing some research and I’m playing with Apache Mahout 0.6 My purpose is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply