Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5931293
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T14:40:26+00:00 2026-05-22T14:40:26+00:00

I am working with some really large databases of newspaper articles, I have them

  • 0

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

I am now searching for ways to help me tag these articles with somewhat descriptive tags.

All these articles is accessible from a URL that looks like this:

http://web.site/CATEGORY/this-is-the-title-slug

So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.

My initial approach was doing this:

  1. Get all articles
  2. Get all words, remove all punctuation, split by space, and count them by occurrence
  3. Analyze them, and filter common non-descriptive words out like “them”, “I”, “this”, “these”, “their” etc.
  4. When all the common words was filtered out, the only thing left is words that is tag-worthy.

But this turned out to be a rather manual task, and not a very pretty or helpful approach.

This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name “John Doe”, and 1.000 articles contains the name of “John Hanson”, I would only get the word “John” out of it, not his first name, and last name.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T14:40:26+00:00Added an answer on May 22, 2026 at 2:40 pm

    Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I’d advise using one of the existing natural language processing toolkits like NLTK.

    To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.

    You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of “words split by a space”. Toolkits like NLTK have functions in-built for this.

    Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have some really complicated legacy code I've been working on that crashes when
I have a large old program which has some rather complex graphical displays (all
I'm working with some really large image cubes that are x * y *
Im currently working on a large project and I have built all the different
I'm working on some production software, using C# on the .NET framework. I really
I really need some help in Regular Expressions, i'm working on a function like
i`m working on my assignment for univ, and since some parts are not really
I'm working some code that inserts csv rows into an SQLite database using Python.
I have written a tool for database replication in PHP. It's working fine but
A large part of a project I'm working on now deals with sending certain

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.