Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8102699
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 5, 20262026-06-05T23:18:46+00:00 2026-06-05T23:18:46+00:00

I have a vector of objects (objects are term nodes that amongst other fields

  • 0

I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string)

class TermNode {
private:
    std::wstring term;
    double weight;
    ...
public:
    ...
};

After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as

std::vector<TermNode *> termlist;

A resulting list of this vector, containing up to 400 entries, looks like this:

DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...

What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. For example, looking at the above list snippet, there is the phrase ‘knowledge discovery’ and therefore I would like to remove the single terms ‘knowledge’ and ‘discovery’, because they are also in the list and redundant in this context. I want to keep the phrases containing the single terms. I am also thinking about to remove all strings equal or less 3 characters. But that is just a thought for now.

For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class.

I am not really sure on how to solve this. The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. That would mean I would have to pre-process that list. I would have to find the single terms and the phrases that contain one of those single terms. I think that is not an easy task to do and would need some advanced algorithm. Using a suffix tree to identify substrings?

Another loop on the vector and maybe a copy of the same vector could to the clean up. I am looking for something most efficient in a time manner.

I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector?.

So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? Maybe I am really missing something, but probably someone is out there and give me a good hint.

Thanks for your thoughts!

Update

I implemented the removal of redundant single terms the way Scrubbins recommended as follows:

/**
 * Functor gets the term of each TermNode object, looks if term string
 * contains spaces (ie. term is a phrase), splits phrase by spaces and finally
 * stores thes term tokens into a set. Only term higher than a score of 
 * 'skipAtWeight" are taken tinto account.
 */
struct findPhrasesAndSplitIntoTokens {
private:
    set<wstring> tokens;
    double skipAtWeight;

public:
    findPhrasesAndSplitIntoTokens(const double skipAtWeight)
    : skipAtWeight(skipAtWeight) {
    }

    /**
     * Implements operator()
     */
    void operator()(const TermNode * tn) {
        // --- skip all terms lower skipAtWeight
        if (tn->getWeight() < skipAtWeight)
            return;

        // --- get term
        wstring term = tn->getTerm();
        // --- iterate over term, check for spaces (if this term is a phrase)
        for (unsigned int i = 0; i < term.length(); i++) {
            if (isspace(term.at(i))) {
if (0) {
                wcout << "input term=" << term << endl;
}
                // --- simply tokenze term by space and store tokens into 
                // --- the tokens set
                // --- TODO: check if this really is UTF-8 aware, esp. for
                // --- strings containing umlauts, etc  !!
                wistringstream iss(term);
                copy(istream_iterator<wstring,
                        wchar_t, std::char_traits<wchar_t> >(iss),
                    istream_iterator<wstring,
                        wchar_t, std::char_traits<wchar_t> >(),
                    inserter(tokens, tokens.begin()));
if (0) {
                wcout << "size of token set=" << tokens.size() << endl;
                for_each(tokens.begin(), tokens.end(), printSingleToken());
}
            }
        }
    }

    /**
     * return set of extracted tokens
     */
    set<wstring> getTokens() const {
        return tokens;
    }
};

/**
 * Functor to find terms in tokens set
 */
class removeTermIfInPhraseTokensSet {
private:
    set<wstring> tokens;

public:
    removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
    : tokens(termTokens) {
    }

    /**
     * Implements operator()
     */
    bool operator()(const TermNode * tn) const {
        if (tokens.find(tn->getTerm()) != tokens.end()) {
            return true;
        }
        return false;
    }
};

...

findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
    termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());

// --- remove all extracted single tokens from the final terms list
// --- of similar search terms 
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
    remove_if(
        termList.begin(), termList.end(), removeTermIfFound),
    termList.end()
);

for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
      tl_iter != termList.end(); tl_iter++) {
    wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
    if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}

...

I could’nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. Anyways. It does the job for now.
The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. It might be not an easy task to do, I wish there would be some “simple heuristics”.
But that might be another new question when stepped further a little more 🙂

So thanks to all for this rich contribution of thoughts!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-05T23:18:47+00:00Added an answer on June 5, 2026 at 11:18 pm

    What you need to do is first, iterate through the list and split up all the multi-word values into single words. If you’re allowing Unicode, this means you will need something akin to ICU’s BreakIterators, else you can go with a simple punctuation/whitespace split. When each string is split into it’s constituent words, then use a hash map to keep a list of all the current words. When you reach a multi-word value, then you can check if it’s words have already been found. This should be the simplest way to identify duplicates.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a class that has a vector of objects. What do I need
I have a class symbol_table that has a vector of objects of another class
I have a vector-like class that contains an array of objects of type T
i'm lost in this , i have a class that has three vector objects
I have a Vector that holds a number of objects. My code uses a
I have a class with a vector of pointers to objects. I've introduced some
I've got a two vectors in class A that contain other class objects B
If I have a vector of objects in one class which I want to
I have a vector (order is important) of objects (lets call them myobj class)
I have two vector objects that contain different types of data that are ordered

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.