I have a large set of data which I access via a generator/iterator. While processing the dataset I need to determine if any record in that dataset has an attribute with the same value as an attribute of the current record being processed. One way to do this would be with a nested for loop. For example, if were processing a database of students, I could do something like:
def fillStudentList():
# TODO: Add some code here to filll
# a student list
pass
students = fillStudentList()
sameLastNames = list()
for student1 in students1:
students2 = fillStudentList()
for student2 in students2:
if student1.lastName == student2.lastName:
sameLastNames.append((student1, student2))
Granted the code snippet above could be improved quite a bit. The goal of the snippet is to show the nested for loop pattern.
Now let’s say that we have a class Student, a class Students (which) is an iterator, and a class Source which provides access to the data in a memory efficient way (say another iterator) of sorts…
Below, I have sketched out what this code might look like. Does anyone have ideas on how to improve this implementation? The goal is to be able to find records in very large datasets with the same attributes so that that filtered set can then be processed.
#!/usr/bin/python
from itertools import ifilter
class Student(object):
"""
A class that represents the first name, last name, and
grade of a student.
"""
def __init__(self, firstName, lastName, grade='K'):
"""
Initializes a Student object
"""
self.firstName = firstName
self.lastName = lastName
self.grade = grade
class Students(object):
"""
An iterator for a collection of students
"""
def __init__(self, source):
"""
"""
self._source = source
self._source_iter = source.get_iter()
self._reset = False
def __iter__(self):
return self
def next(self):
try:
if self._reset:
self._source_iter = self._source.get_iter()
self._reset = False
return self._source_iter.next()
except StopIteration:
self._reset = True
raise StopIteration
def select(self, attr, val):
"""
Return all of the Students with a given
attribute
"""
#select_iter = self._source.get_iter()
select_iter = self._source.filter(attr, val)
for selection in select_iter:
# if (getattr(selection, attr) == val):
# yield selection
yield(selection)
class Source(object):
"""
A source of data that can provide an iterator to
all of the data or provide an iterator to the
data based on some attribute
"""
def __init__(self, data):
self._data = data
def get_iter(self):
"""
Return an iterator to the data
"""
return iter(self._data)
def filter(self, attr, val):
"""
Return an iterator to the data filtered by some
attribute
"""
return ifilter(lambda rec: getattr(rec, attr) == val, self._data)
def test_it():
"""
"""
studentList = [Student("James","Smith","6"),
Student("Jill","Jones","6"),
Student("Bill","Deep","5"),
Student("Bill","Sun","5")]
source = Source(studentList)
students = Students(source)
for student in students:
print student.firstName
for same_names in students.select('firstName', student.firstName):
if same_names.lastName == student.lastName:
continue
else:
print " %s %s in grade %s has your same first name" % \
(same_names.firstName, same_names.lastName, same_names.grade)
if __name__ == '__main__':
test_it()
Nested loops are O(n**2). You can instead use a sort and
itertools.groupbyfor O(nlogn) performance:In general, you appear to be trying to do what an ORM backed by a database does. Instead of doing it yourself, use one of the many ORMs already out there. See What are some good Python ORM solutions? for a list. They will be both more optimized and more powerful than something you would code yourself.