Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6777935
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T16:12:25+00:00 2026-05-26T16:12:25+00:00

This came up in another question but I figured it is best to ask

  • 0

This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):

[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]

what is the best way to code the following function?

def GetSentences(word1, word2, position):
    return ""

where given two words, word1, word2 and a position position, the function should return the list of all sentences satisfying that constraint. For example:

GetSentences("sentence", "another", 3)

should return sentences 1 and 3 as the index of the sentences. My current approach was using a dictionary like this:

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))

for sentenceIndex, sentence in enumerate(sentences):
    words = sentence.split()
    for index, word in enumerate(words):
        for i, word2 in enumerate(words[index:):
            Index[word][word2][i+1].append(sentenceIndex)

But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can’t find any references on how to solve this efficiently. Any suggestions on how to approach this?

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T16:12:26+00:00Added an answer on May 26, 2026 at 4:12 pm

    Use database for storing values.

    1. First add all the sentences to one table (they should have IDs). You may call it eg. sentences.
    2. Second, create table with words contained within all the sentences (call it eg. words, give each word an ID), saving connection between sentences’ table records and words’ table records within separate table (call it eg. sentences_words, it should have two columns, preferably word_id and sentence_id).
    3. When searching for sentences containing all the mentioned words, your job will be simplified:

      1. You should first find records from words table, where words are exactly the ones you search for. The query could look like this:

        SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
        
      2. Second, you should find sentence_id values from table sentences that have required word_id values (corresponding to the words from words table). The initial query could look like this:

        SELECT `sentence_id`, `word_id` FROM `sentences_words`
        WHERE `word_id` IN ([here goes list of words' ids]);
        

        which could be simplified to this:

        SELECT `sentence_id`, `word_id` FROM `sentences_words`
        WHERE `word_id` IN (
            SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
        );
        
      3. Filter the result within Python to return only sentence_id values that have all the required word_id IDs you need.

    This is basically a solution based on storing big amount of data in the form that is best suited for this – the database.

    EDIT:

    1. If you will only search for two words, you can do even more (almost everything) on DBMS’ side.
    2. Considering you need also position difference, you should store the position of the word within third column of sentences_words table (lets call it just position) and when searching for appropriate words, you should calculate difference of this value associated with both words.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This came up when answering another user's question (TheSoftwareJedi)... Given the following table: ROW_PRIORITY
This question came today in the manipulatr mailing list. http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f I am rephrasing. Given
I have came from another question to this one: How can one start with
Hi actually this is a simple question but just came up out of the
There's this already populated database which came from another dev. I'm not sure what
This came up from this answer to a previous question of mine . Is
This came up in Hidden features of Python , but I can't see good
This question came to my mind when I learned C++ with a background of
This question came about because the cells gem specifies template directories using File.join('app','cells'). That
Writing another question for SO, I came to a pattern that I use very

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.