Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 594901
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T16:00:13+00:00 2026-05-13T16:00:13+00:00

I have a list of lists in Python: k = [[1, 2], [4], [5,

  • 0

I have a list of lists in Python:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

And I want to remove duplicate elements from it. Was if it a normal list not of lists I could used set. But unfortunate that list is not hashable and can’t make set of lists. Only of tuples. So I can turn all lists to tuples then use set and back to lists. But this isn’t fast.

How can this done in the most efficient way?

The result of above list should be:

k = [[5, 6, 2], [1, 2], [3], [4]]

I don’t care about preserve order.

Note: this question is similar but not quite what I need. Searched SO but didn’t find exact duplicate.


Benchmarking:

import itertools, time


class Timer(object):
    def __init__(self, name=None):
        self.name = name

    def __enter__(self):
        self.tstart = time.time()

    def __exit__(self, type, value, traceback):
        if self.name:
            print '[%s]' % self.name,
        print 'Elapsed: %s' % (time.time() - self.tstart)


k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000

print len(k)

with Timer('set'):
    for i in xrange(N):
        kt = [tuple(i) for i in k]
        skt = set(kt)
        kk = [list(i) for i in skt]


with Timer('sort'):
    for i in xrange(N):
        ks = sorted(k)
        dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]


with Timer('groupby'):
    for i in xrange(N):
        k = sorted(k)
        dedup = list(k for k, _ in itertools.groupby(k))

with Timer('loop in'):
    for i in xrange(N):
        new_k = []
        for elem in k:
            if elem not in new_k:
                new_k.append(elem)

“loop in” (quadratic method) fastest of all for short lists. For long lists it’s faster then everyone except groupby method. Does this make sense?

For short list (the one in the code), 100000 iterations:

[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665

For longer list (the one in the code duplicated 5 times):

[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T16:00:13+00:00Added an answer on May 13, 2026 at 4:00 pm
    >>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    >>> import itertools
    >>> k.sort()
    >>> list(k for k,_ in itertools.groupby(k))
    [[1, 2], [3], [4], [5, 6, 2]]
    

    itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)

    Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it’s so much easier that it offers good returns on efforts. But sometimes (essentially for “tragically crucial bottlenecks” in deep inner loops of code that’s pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one’s apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

    Careful measurements of “point” performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it’s easier to use it at a shell prompt. For example, here’s a short module to showcase the general approach for this problem, save it as nodup.py:

    import itertools
    
    k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
    
    def doset(k, map=map, list=list, set=set, tuple=tuple):
      return map(list, set(map(tuple, k)))
    
    def dosort(k, sorted=sorted, xrange=xrange, len=len):
      ks = sorted(k)
      return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
    
    def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
      ks = sorted(k)
      return [i for i, _ in itertools.groupby(ks)]
    
    def donewk(k):
      newk = []
      for i in k:
        if i not in newk:
          newk.append(i)
      return newk
    
    # sanity check that all functions compute the same result and don't alter k
    if __name__ == '__main__':
      savek = list(k)
      for f in doset, dosort, dogroupby, donewk:
        resk = f(k)
        assert k == savek
        print '%10s %s' % (f.__name__, sorted(resk))
    

    Note the sanity check (performed when you just do python nodup.py) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

    Now we can run checks on the tiny example list:

    $ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
    100000 loops, best of 3: 11.7 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
    100000 loops, best of 3: 9.68 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
    100000 loops, best of 3: 8.74 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
    100000 loops, best of 3: 4.44 usec per loop
    

    confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

    $ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
    10000 loops, best of 3: 25.4 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
    10000 loops, best of 3: 23.7 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
    10000 loops, best of 3: 31.3 usec per loop
    $ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
    10000 loops, best of 3: 25 usec per loop
    

    the quadratic approach isn’t bad, but the sort and groupby ones are better. Etc, etc.

    If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it’s worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

    It’s also well worth considering keeping a different representation for k — why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program’s performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.