I need to loop over a dataset which is sorted, grouping all the results

Question

0

Asked: June 9, 20262026-06-09T10:31:23+00:00 2026-06-09T10:31:23+00:00

I need to loop over a dataset which is sorted, grouping all the results

0

I need to loop over a dataset which is sorted, grouping all the results by that sorted attribute into chunks which all have the same value for that attribute. Then I run some operations on that chunk of results.

Sorry that’s a bit confusing, examples are probably a better way of describing what I’m doing:

I’ve got a dataset that’s structured like this except the “data” strings are actually objects and contain plenty of other data.

[ [1, "data1"], [1, "data2"], [2, "moredata"], [2, "stuff"], 
  [2, "things"], [2, "foo"], [3, "bar"], [4, "baz"] ]

What I want to happen is for that data to get grouped into 4 different function calls:

process_data(1, ["data1", "data2"])
process_data(2, ["moredata", "stuff", "things", "foo"])
process_data(3, ["bar"])
process_data(4, ["baz"])

What I end up with is a construct that looks something like this:

last_id = None
grouped_data = []

for row in dataset:
    id = row[0]
    data = row[1]

    if last_id != id:
         # we're starting a new group, process the last group
         processs_data(last_id, grouped_data)
         grouped_data = []
    last_id = id
    grouped_data.append(data)

if grouped_data:
    # we're done the loop and we still have a last group of data to process
    # if there was no data in the dataset, grouped_data will still be empty
    # so we won't accidentally process any empty data.
    process_data(last_id, grouped_data)

It works, but it seems clumsy. Especially the need to track everything with the last_id variable as well as the second call to process_data after the loop. I’d just like to know if anyone can offer any suggestions for a more elegant/clever solution.

My language of choice is Python, but a general solution is fine.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T10:31:25+00:00

itertools.groupby is just what you want:

>>> data = [ [1, "data1"], [1, "data2"], [2, "moredata"], [2, "stuff"],
...   [2, "things"], [2, "foo"], [3, "bar"], [4, "baz"] ]
>>>
>>> from itertools import groupby
>>> from operator import itemgetter
>>>
>>> def process_data(key, keydata):
...     print key, ':', keydata
...
>>> for key,keydata in groupby(data, key=itemgetter(0)):
...   process_data(key, [d[1] for d in keydata])
...
1 : ['data1', 'data2']
2 : ['moredata', 'stuff', 'things', 'foo']
3 : ['bar']
4 : ['baz']

Pass groupby a sorted list, and a key function on what to group by within each item in the list. You get back a generator of (key,itemgenerator) pairs, as shown being passed to my made-up process_data method.

[Added 8 Aug 2023]
I have more details in a pair of blog posts on groupby, starting with this one.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to loop over a dataset which is sorted, grouping all the results

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply