Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8329901
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T01:56:54+00:00 2026-06-09T01:56:54+00:00

I am stuck between a decision to apply classification or clustering on the data

  • 0

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.

I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item “Apple launches new iphone”, I need to associate the company Apple with it. A particular news item/document only contains ‘title’ and ‘description’ so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.

To solve this, I turned to Mahout.

I started with clustering. I was hoping to get ‘Apple’, ‘Google’, ‘Intel’ etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got ‘investment’, ‘stocks’, ‘correspondence’, ‘green energy’, ‘terminal’, ‘shares’, ‘street’, ‘olympics’ and lots of other terms as the top ones (which makes sense as clustering algos’ look for common terms). Although there were some ‘Apple’ clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).

I started reading about classification which requires training data, The name was convincing too as I actually want to ‘classify’ my news items into ‘company names’. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name ‘Apple’ then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)

You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says ‘iphone sales at a record high’ without using the word ‘Apple’, the system can classify it as a news related to apple?

Thank you in advance for pointing me in the right direction.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T01:56:55+00:00Added an answer on June 9, 2026 at 1:56 am

    Copying my reply from the mailing list:

    Classifiers are supervised learning algorithms, so you need to provide
    a bunch of examples of positive and negative classes. In your example,
    it would be fine to label a bunch of articles as “about Apple” or not,
    then use feature vectors derived from TF-IDF as input, with these
    labels, to train a classifier that can tell when an article is “about
    Apple”.

    I don’t think it will quite work to automatically generate the
    training set by labeling according to the simple rule, that it is
    about Apple if ‘Apple’ is in the title. Well, if you do that, then
    there is no point in training a classifier. You can make a trivial
    classifier that achieves 100% accuracy on your test set by just
    checking if ‘Apple’ is in the title! Yes, you are right, this gains
    you nothing.

    Clearly you want to learn something subtler from the classifier, so
    that an article titled “Apple juice shown to reduce risk of dementia”
    isn’t classified as about the company. You’d really need to feed it
    hand-classified documents.

    That’s the bad news, but, sure you can certainly train N classifiers
    for N topics this way.

    Classifiers put items into a class or not. They are not the same as
    regression techniques which predict a continuous value for an input.
    They’re related but distinct.

    Clustering has the advantage of being unsupervised. You don’t need
    labels. However the resulting clusters are not guaranteed to match up
    to your notion of article topics. You may see a cluster that has a lot
    of Apple articles, some about the iPod, but also some about Samsung
    and laptops in general. I don’t think this is the best tool for your
    problem.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I got stuck again between browsers compatability issues, what I was trying todo is
My simple communication between C++ client and C# server got stuck after a message
I'm trying to pass structs between processes using named pipes. I got stuck at
I have a idea to make an application but i got stuck in between.
I am studying Cnc programming and I got stuck on the difference between G41
I am stuck on a project which needs to exchange data between two programs
I am stuck in between of a problem where only one pass of regular
I am new to Rails and TDD and am stuck at a transition between
I need to share a stack of strings between processes (possibly more complex objects
Kinda stuck here... I have an application with lets say 5000 rows of data

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.