Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8221809
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T14:03:42+00:00 2026-06-07T14:03:42+00:00

Advice please. I have a collection of documents that all share a common attribute

  • 0

Advice please. I have a collection of documents that all share a common attribute (e.g. The word French appears) some of these documents have been marked as not pertinent to this collection (e.g. French kiss appears) but not all documents are guaranteed to have been identified. What is the best method to use to figure out which other documents don’t belong.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T14:03:45+00:00Added an answer on June 7, 2026 at 2:03 pm

    Assumptions

    Given your example “French”, I will work under the assumption that the feature is a word that appears in the document. Also, since you mention that “French kiss” is not relevant, I will further assume that in your case, a feature is a word used in a particular sense. For example, if “pool” is a feature, you may say that documents mentioning swimming pools are relevant, but those talking about pool (the sport, like snooker or billiards) are not relevant.

    • Note: Although word sense disambiguation (WSD) methods would work, they require too much effort, and is an overkill for this purpose.

    Suggestion: localized language model + bootstrapping

    Think of it this way: You don’t have an incomplete training set, but a smaller training set. The idea is to use this small training data to build bigger training data. This is bootstrapping.

    For each occurrence of your feature in the training data, build a language model based only on the words surrounding it. You don’t need to build a model for the entire document. Ideally, just the sentences containing the feature should suffice. This is what I am calling a localized language model (LLM).

    Build two such LLMs from your training data (let’s call it T_0): one for pertinent documents, say M1, and another for irrelevant documents, say M0. Now, to build a bigger training data, classify documents based on M1 and M0. For every new document d, if d does not contain the feature-word, it will automatically be added as a “bad” document. If d contains the feature-word, then consider a local window around this word in d (the same window size that you used to build the LLMs), and compute the perplexity of this sequence of words with M0 and M1. Classify the document as belonging to the class which gives lower perplexity.

    To formalize, the pseudo-code is:

    T_0 := initial training set (consisting of relevant/irrelevant documents)
    D0 := additional data to be bootstrapped
    N := iterations for bootstrapping
    
    for i = 0 to N-1
      T_i+1 := empty training set
      Build M0 and M1 as discussed above using a window-size w
      for d in D0
        if feature-word not in d
        then add d to irrelevant documents of T_i+1
        else
          compute perplexity scores P0 and P1 corresponding to M0 and M1 using
          window size w around the feature-word in d.
          if P0 < P1 - delta
            add d to irrelevant documents of T_i+1
          else if P1 < P0 - delta
            add d to relevant documents of T_i+1
          else
            do not use d in T_i+1
          end
        end
      end
      Select a small random sample from relevant and irrelevant documents in
      T_i+1, and (re)classify them manually if required.
    end
    
    • T_N is your final training set. In this above bootstrapping, the parameter delta needs to be determined with experiments on some held-out data (also called development data).
    • The manual reclassification on a small sample is done so that the noise during this bootstrapping is not accumulated through all the N iterations.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am beginning with mongodb and have a collection with documents that look like
Can anybody give me a little advice please? I have a string, for example
I'm having a problem using deploy:deploy-file with snapshots I'd like some advice on please.
Im new in C++. I need to listen HTTP requests. Please advice me some
Please advice a book on spatial data structures. I'm interested in Quadtrees and Octrees
Popups in jQuery Mobile not working as intended, please advice. The popup DIV is
Please somebody advice me how i can run external application from web browser on
Please, help me. I need an advice in problem with progress bar in IE.
I have a coding/maths problem that I need help translating into C#. It's a
I have an options menu created and filled with several menu items. For some

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.