Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 42095
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T15:17:04+00:00 2026-05-10T15:17:04+00:00

Without getting a degree in information retrieval, I’d like to know if there exists

  • 0

Without getting a degree in information retrieval, I’d like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a ‘general feel’ of what people are saying over a set of textual comments. Along the lines of Wordle.

What I’d like:

  • ignore articles, pronouns, etc (‘a’, ‘an’, ‘the’, ‘him’, ‘them’ etc)
  • preserve proper nouns
  • ignore hyphenation, except for soft kind

Reaching for the stars, these would be peachy:

  • handling stemming & plurals (e.g. like, likes, liked, liking match the same result)
  • grouping of adjectives (adverbs, etc) with their subjects (‘great service’ as opposed to ‘great’, ‘service’)

I’ve attempted some basic stuff using Wordnet but I’m just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T15:17:04+00:00Added an answer on May 10, 2026 at 3:17 pm

    You’ll need not one, but several nice algorithms, along the lines of the following.

    • ignoring pronouns is done via a stoplist.
    • preserving proper nouns? You mean, detecting named entities, like Hoover Dam and saying ‘it’s one word’ or compound nouns, like programming language? I’ll give you a hint: that’s tough one, but there exist libraries for both. Look for NER (Named entitiy recognition) and lexical chunking. OpenNLP is a Java-Toolkit that does both.
    • ignoring hyphenation? You mean, like at line breaks? Use regular expressions and verify the resulting word via dictionary lookup.
    • handling plurals/stemming: you can look into the Snowball stemmer. It does the trick nicely.
    • ‘grouping’ adjectives with their nouns is generally a task of shallow parsing. But if you are looking specifically for qualitative adjectives (good, bad, shitty, amazing…) you may be interested in sentiment analysis. LingPipe does this, and a lot more.

    I’m sorry, I know you said you wanted to KISS, but unfortunately, your demands aren’t that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don’t want to. If you want to perform a task yourself, I suggest you look at stemming, it’s the easiest of all.

    If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won’t need too much out of it. You might also be interested in NLTK, written in Python.

    I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.

    Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It’s pretty much the best way to look for document frequency for terms. In order to do it properly, you won’t get around using multidimenional vector matrices.

    … Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 62k
  • Answers 62k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • added an answer It would be a bad (circular) design to give your… May 11, 2026 at 10:03 am
  • added an answer An empty synchronized block will wait until nobody else is… May 11, 2026 at 10:03 am
  • added an answer Yes, Delphi offers TStringBuilder (since version 2009): procedure TestStringBuilder; var… May 11, 2026 at 10:03 am

Related Questions

Without getting a degree in information retrieval, I'd like to know if there exists
I have a portion of a bash script that is getting a filename without
Is there a wrapper around the Win32 API for getting font information without actually
I keep getting tasks that are above my skill level. How can I address this without coming accross as grossly incompetent?
Without the use of any external library, what is the simplest way to fetch
Without calculating them , I mean?
Without spending a long time reviewing the boost source code, could someone give me
Without using Javascript, is there a way to make a CSS property toggle on
Without using a WebBrowser control, how could I execute a JavaScript function, that is
Without routing, HttpContext.Current.Session is there so I know that the StateServer is working. When

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.