I have a large set of data which I access via a generator/iterator. While

Question

0

Asked: June 1, 20262026-06-01T14:03:01+00:00 2026-06-01T14:03:01+00:00

I have a large set of data which I access via a generator/iterator. While

0

I have a large set of data which I access via a generator/iterator. While processing the dataset I need to determine if any record in that dataset has an attribute with the same value as an attribute of the current record being processed. One way to do this would be with a nested for loop. For example, if were processing a database of students, I could do something like:

def fillStudentList():
    # TODO: Add some code here to  filll
    # a student list
    pass

students = fillStudentList()
sameLastNames = list()
for student1 in students1:
  students2 = fillStudentList()
  for student2 in students2:
    if student1.lastName == student2.lastName:
        sameLastNames.append((student1, student2))

Granted the code snippet above could be improved quite a bit. The goal of the snippet is to show the nested for loop pattern.

Now let’s say that we have a class Student, a class Students (which) is an iterator, and a class Source which provides access to the data in a memory efficient way (say another iterator) of sorts…

Below, I have sketched out what this code might look like. Does anyone have ideas on how to improve this implementation? The goal is to be able to find records in very large datasets with the same attributes so that that filtered set can then be processed.

#!/usr/bin/python

from itertools import ifilter

class Student(object):
    """
    A class that represents the first name, last name, and
    grade of a student.
    """
    def __init__(self, firstName, lastName, grade='K'):
        """
        Initializes a Student object
        """
        self.firstName = firstName
        self.lastName = lastName
        self.grade = grade

class Students(object):
    """
    An iterator for a collection of students
    """
    def __init__(self, source):
        """
        """
        self._source = source
        self._source_iter = source.get_iter()
        self._reset = False

    def __iter__(self):
        return self

    def next(self):
        try:
            if self._reset:
                self._source_iter = self._source.get_iter()
                self._reset = False
            return self._source_iter.next()
        except StopIteration:
            self._reset = True
            raise StopIteration

    def select(self, attr, val):
        """
        Return all of the Students with a given
        attribute
        """
        #select_iter = self._source.get_iter()
        select_iter = self._source.filter(attr, val)
        for selection in select_iter:
            # if (getattr(selection, attr) == val):
            #    yield selection
            yield(selection)

class Source(object):
    """
    A source of data that can provide an iterator to 
    all of the data or provide an iterator to the
    data based on some attribute
    """
    def __init__(self, data):
        self._data = data

    def get_iter(self):
        """
        Return an iterator to the data
        """
        return iter(self._data)

    def filter(self, attr, val):
        """
        Return an iterator to the data filtered by some
        attribute
        """
        return ifilter(lambda rec: getattr(rec, attr) == val, self._data)

def test_it():
    """
    """
    studentList = [Student("James","Smith","6"),
                   Student("Jill","Jones","6"),
                   Student("Bill","Deep","5"),
                   Student("Bill","Sun","5")]
    source = Source(studentList)
    students = Students(source)
    for student in students:
        print student.firstName

        for same_names in students.select('firstName', student.firstName):
            if same_names.lastName == student.lastName:
                continue
            else:
                print " %s %s in grade %s has your same first name" % \
                (same_names.firstName, same_names.lastName, same_names.grade)

if __name__ == '__main__':
    test_it()

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T14:03:02+00:00

Nested loops are O(n**2). You can instead use a sort and itertools.groupby for O(nlogn) performance:

students = fill_student_list()
same_last_names = [list(group) for lastname, group in 
                   groupby(sorted(students, key=operator.attrgetter('lastname'))]

In general, you appear to be trying to do what an ORM backed by a database does. Instead of doing it yourself, use one of the many ORMs already out there. See What are some good Python ORM solutions? for a list. They will be both more optimized and more powerful than something you would code yourself.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large set of data which I access via a generator/iterator. While

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply