I would like to loop through a big two dimension list:
authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"], ... ]
and get a list that contains all the names that occurs in authors.
When I loop through the list, I need a container to store names I’ve already seen, I’m wondering if I should use a list or a dict:
with a list:
seen = []
for author_list in authors:
for author in author_list:
if not author in seen:
seen.append(author)
result = seen
with a dict:
seen = {}
for author_list in authors:
for author in author_list:
if not author in seen:
seen[author] = True
result = seen.keys()
which one is faster? or is there better solutions?
You really want a
set. Sets are faster than lists because they can only contain unique elements, which allows them to be implemented as hash tables. Hash tables allow membership testing (if element in my_set) inO(1)time. This contrasts with lists, where the only way to check if an element is in the list is to check every element of the list in turn (inO(n)time.)A
dictis similar to asetin that both allow unique keys only, and both are implemented as hash tables. They both allowO(1)membership testing. The difference is that asetonly has keys, while adicthas both keys and values (which is extra overhead you don’t need in this application.)Using a
set, and replacing the nested for loop with anitertools.chain()to flatten the 2D list to a 1D list:Or shorter:
Edit (thanks, @jamylak) more memory efficient for large lists:
Example on a list of lists:
P.S. : If, instead of finding all the unique authors, you want to count the number of times you see each author, use a
collections.Counter, a special kind of dictionary optimised for counting things.Here’s an example of counting characters in a string: