Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8079617
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T16:09:49+00:00 2026-06-05T16:09:49+00:00

I need advice or directions on how to write an algorithm which will find

  • 0

I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.

The string contains:

  • Technical information written in English (GB)
  • Words are mostly separated by spaces
  • A keyword does not contain a space but it may contain a hyphen, apostrophe, colon etc.
  • A keyphrase may contain a space, a comma or other punctuation
  • If two or more keywords appear together then it is likely a keyphrase e.g. “inverter drive”
  • The text also contains HTML but this can be removed beforehand if necessary
  • Non-keywords would be words like “and”, “the”, “we”, “see”, “look” etc.
  • Keywords are case-insensitive e.g. “Inverter” and “inverter” are the same keyword

The algorithm has the following requirements:

  1. Operate in a batch-processing scenario e.g. run once or twice a day
  2. Process strings varying in length from roughly 200 to 7000 characters
  3. Process 1000 strings in less than 1 hour
  4. Will execute on a server with moderately good power
  5. Written in one of the following: C#, VB.NET, or T-SQL maybe even F#, Python or Lua etc.
  6. Does not rely on a list of predefined keywords or keyphrases
  7. But can rely on a list of keyword exclusions e.g. “and”, “the”, “go” etc.
  8. Ideally transferable to other languages e.g. doesn’t rely on language-specific features e.g. metaprogramming
  9. Output a list of keyphrases (descending order of frequency) followed by a list of keywords (descending order of frequency)

It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I’m already asking enough!

Just looking for advice and directions:

  • Should this be regarded as two separate algorithms?
  • Are there any established algorithms which I could follow?
  • Are my requirements feasible?

Many thanks.

P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T16:09:50+00:00Added an answer on June 5, 2026 at 4:09 pm

    The logic involved makes it complicated to be programmed in T-SQL. Choose a language like C#. First try to make a simple desktop application. Later, if you find that loading all the records to this application is too slow, you could write a C# stored procedure that is executed on the SQL-Server. Depending on the security policy of the SQL-Server, it will need to have a strong key.


    To the algorithm now. A list of excluded words is commonly called a stop word list. If you do some googling for this search term, you might find stop word lists you can start with. Add these stop words to a HashSet<T> (I’ll be using C# here)

    // Assuming that each line contains one stop word.
    HashSet<string> stopWords =
        new HashSet<string>(File.ReadLines("C:\stopwords.txt"), StringComparer.OrdinalIgnoreCase);
    

    Later you can look if a keyword candidate is in the stop word list with

    If (!stopWords.Contains(candidate)) {
        // We have a keyword
    }
    

    HashSets are fast. They have an access time of O(1), meaning that the time required to do a lookup does not depend on the number items it contains.

    Looking for the keywords can easily be done with Regex.

    string text = ...; // Load text from DB
    MatchCollection matches = Regex.Matches(text, "[a-z]([:']?[a-z])*",
                                            RegexOptions.IgnoreCase);
    foreach (Match match in matches) {
        if (!stopWords.Contains(match.Value)) {
            ProcessKeyword(match.Value); // Do whatever you need to do here
        }
    }
    

    If you find that a-z is too restrictive for letters and need accented letters you can change the regex expression to @"\p{L}([:']?\p{L})*". The character class \p{L} contains all letters and letter modifiers.

    The phrases are more complicated. You could try to split the text into phrases first and then apply the keyword search on these phrases instead of searching the keywords in the whole text. This would give you the number of keywords in a phrase at the same time.

    Splitting the text into phrases involves searching for sentences ending with “.” or “?” or “!” or “:”. You should exclude dots and colons that appear within a word.

    string[] phrases = Regex.Split(text, @"[\.\?!:](\s|$)");
    

    This searches punctuations followed either by a whitespace or an end of line. But I must agree that this is not perfect. It might erroneously detect abbreviations as sentence end. You will have to make experiments in order to refine the splitting mechanism.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need some advice when it comes to solving a sorting algorithm. This particular
Hi I need some iterator advice. I have a Category object which can contain
Wondering if I could get some advice and direction on this following requirement: Need
I need advice on the following HTML: <!-- Beginning of ROW !--> <div id="row1">
I need advice on using stored procedures with Entity Framwork 4.x to return data
I need advice on how to change the file formatted following way file1: A
I'd need advice on following situation with Oracle/PostgreSQL: I have a db table with
I really need advice on how to do the following. I have tried several
I think I have this working but I need advice. I'd like to know
I have 3 tables, I need advice on how to get data from them.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.