Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 509157
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T06:58:54+00:00 2026-05-13T06:58:54+00:00

I’m doing an iteration through 3 words, each about 5 million characters long, and

  • 0

I’m doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I’ve written takes an extremely long time to run. I’ve never even completed one word running my program over night.

The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.

If anybody has an idea how to optimize this I would be really thankful, I don’t have a clue how to continue…

here’s a sample of my code:

def findUnique(list):
    # Takes a list with dictionaries and compairs each element in the dictionaries
    # with the others and puts all unique element in new dictionaries and finally
    # puts the new dictionaries in a list.
    # The result is a list with (in this case) 3 dictionaries containing all unique
    # sequences and their locations from each string.
    dicList=[]
    listlength=len(list)
    s=0
    valuelist=[]
    for i in list:
        j=i.values()
        valuelist.append(j)
    while s<listlength:
        currdic=list[s]
        dic={}
        for key in currdic:
            currval=currdic[key]
            test=True
            n=0
            while n<listlength:
                if n!=s:
                    if currval in valuelist[n]: #this is where it takes to much time
                        n=listlength
                        test=False
                    else:
                        n+=1
                else:
                    n+=1
            if test:
                dic[key]=currval
        dicList.append(dic)
        s+=1
    return dicList
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T06:58:55+00:00Added an answer on May 13, 2026 at 6:58 am
    def slices(seq, length, prefer_last=False):
      unique = {}
      if prefer_last: # this doesn't have to be a parameter, just choose one
        for start in xrange(len(seq) - length + 1):
          unique[seq[start:start+length]] = start
      else: # prefer first
        for start in xrange(len(seq) - length, -1, -1):
          unique[seq[start:start+length]] = start
      return unique
    
    # or find all locations for each slice:
    import collections
    def slices(seq, length):
      unique = collections.defaultdict(list)
      for start in xrange(len(seq) - length + 1):
        unique[seq[start:start+length]].append(start)
      return unique
    

    This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don’t want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special “lazy slice” class that stores the base sequence (the string) plus start and stop (or start and length).

    Printing unique slices:

    a = set(slices("aab", 2)) # {"aa", "ab"}
    b = set(slices("abb", 2)) # {"ab", "bb"}
    c = set(slices("abc", 2)) # {"ab", "bc"}
    all = [a, b, c]
    import operator
    a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
    print a_unique # {"aa"}
    

    Including locations:

    a = slices("aab", 2)
    b = slices("abb", 2)
    c = slices("abc", 2)
    all = [a, b, c]
    import operator
    a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
    # a_unique is only the keys so far
    a_unique = dict((k, a[k]) for k in a_unique)
    # now it's a dict of slice -> location(s)
    print a_unique # {"aa": 0} or {"aa": [0]}
                   # (depending on which slices function used)
    

    In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.

    import operator
    import random
    import string
    
    def slices(seq, length):
      unique = {}
      for start in xrange(len(seq) - length, -1, -1):
        unique[seq[start:start+length]] = start
      return unique
    
    def sample_with_repeat(population, length, choice=random.choice):
      return "".join(choice(population) for _ in xrange(length))
    
    word_length = 5*1000*1000
    words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
    slice_length = 20
    words_slices_sets = [set(slices(x, slice_length)) for x in words]
    unique_words_slices = [reduce(operator.sub,
                                  (x for x in words_slices_sets if x is not n),
                                  n)
                           for n in words_slices_sets]
    print [len(x) for x in unique_words_slices]
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.