Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 817247
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 15, 20262026-05-15T01:57:47+00:00 2026-05-15T01:57:47+00:00

I have a program that operates on a large set of experimental data. The

  • 0

I have a program that operates on a large set of experimental data. The data is stored as a list of objects that are instances of a class with the following attributes:

  • time_point – the time of the sample
  • cluster – the name of the cluster of nodes from which the sample was taken
  • node – the name of the node from which the sample was taken
  • qty1 = the value of the sample for the first quantity
  • qty2 = the value of the sample for the second quantity

I need to derive some values from the data set, grouped in three ways – once for the sample as a whole, once for each cluster of nodes, and once for each node. The values I need to derive depend on the (time sorted) cumulative sums of qty1 and qty2: the maximum value of the element-wise sum of the cumulative sums of qty1 and qty2, the time point at which that maximum value occurred, and the values of qty1 and qty2 at that time point.

I came up with the following solution:

dataset.sort(key=operator.attrgetter('time_point'))

# For the whole set
sys_qty1 = 0
sys_qty2 = 0
sys_combo = 0
sys_max = 0

# For the cluster grouping
cluster_qty1 = defaultdict(int)
cluster_qty2 = defaultdict(int)
cluster_combo = defaultdict(int)
cluster_max = defaultdict(int)
cluster_peak = defaultdict(int)

# For the node grouping
node_qty1 = defaultdict(int)
node_qty2 = defaultdict(int)
node_combo = defaultdict(int)
node_max = defaultdict(int)
node_peak = defaultdict(int)

for t in dataset:
  # For the whole system ######################################################
  sys_qty1 += t.qty1
  sys_qty2 += t.qty2
  sys_combo = sys_qty1 + sys_qty2
  if sys_combo > sys_max:
    sys_max = sys_combo
    # The Peak class is to record the time point and the cumulative quantities
    system_peak = Peak(time_point=t.time_point,
                       qty1=sys_qty1,
                       qty2=sys_qty2)
  # For the cluster grouping ##################################################
  cluster_qty1[t.cluster] += t.qty1
  cluster_qty2[t.cluster] += t.qty2
  cluster_combo[t.cluster] = cluster_qty1[t.cluster] + cluster_qty2[t.cluster]
  if cluster_combo[t.cluster] > cluster_max[t.cluster]:
    cluster_max[t.cluster] = cluster_combo[t.cluster]
    cluster_peak[t.cluster] = Peak(time_point=t.time_point,
                                   qty1=cluster_qty1[t.cluster],
                                   qty2=cluster_qty2[t.cluster])
  # For the node grouping #####################################################
  node_qty1[t.node] += t.qty1
  node_qty2[t.node] += t.qty2
  node_combo[t.node] = node_qty1[t.node] + node_qty2[t.node]
  if node_combo[t.node] > node_max[t.node]:
    node_max[t.node] = node_combo[t.node]
    node_peak[t.node] = Peak(time_point=t.time_point,
                             qty1=node_qty1[t.node],
                             qty2=node_qty2[t.node])

This produces the correct output, but I’m wondering if it can be made more readable/Pythonic, and/or faster/more scalable.

The above is attractive in that it only loops through the (large) dataset once, but unattractive in that I’ve essentially copied/pasted three copies of the same algorithm.

To avoid the copy/paste issues of the above, I tried this also:

def find_peaks(level, dataset):

  def grouping(object, attr_name):
    if attr_name == 'system':
      return attr_name
    else:
      return object.__dict__[attrname]

  cuml_qty1 = defaultdict(int)
  cuml_qty2 = defaultdict(int)
  cuml_combo = defaultdict(int)
  level_max = defaultdict(int)
  level_peak = defaultdict(int)

  for t in dataset:
    cuml_qty1[grouping(t, level)] += t.qty1
    cuml_qty2[grouping(t, level)] += t.qty2
    cuml_combo[grouping(t, level)] = (cuml_qty1[grouping(t, level)] +
                                      cuml_qty2[grouping(t, level)])
    if cuml_combo[grouping(t, level)] > level_max[grouping(t, level)]:
      level_max[grouping(t, level)] = cuml_combo[grouping(t, level)]
      level_peak[grouping(t, level)] = Peak(time_point=t.time_point,
                                            qty1=node_qty1[grouping(t, level)],
                                            qty2=node_qty2[grouping(t, level)])
  return level_peak

system_peak = find_peaks('system', dataset)
cluster_peak = find_peaks('cluster', dataset)
node_peak = find_peaks('node', dataset)

For the (non-grouped) system-level calculations, I also came up with this, which is pretty:

dataset.sort(key=operator.attrgetter('time_point'))

def cuml_sum(seq):
  rseq = []
  t = 0
  for i in seq:
    t += i
    rseq.append(t)
  return rseq

time_get = operator.attrgetter('time_point')
q1_get = operator.attrgetter('qty1')
q2_get = operator.attrgetter('qty2')

timeline = [time_get(t) for t in dataset]
cuml_qty1 = cuml_sum([q1_get(t) for t in dataset])
cuml_qty2 = cuml_sum([q2_get(t) for t in dataset])
cuml_combo = [q1 + q2 for q1, q2 in zip(cuml_qty1, cuml_qty2)]

combo_max = max(cuml_combo)
time_max = timeline.index(combo_max)
q1_at_max = cuml_qty1.index(time_max)
q2_at_max = cuml_qty2.index(time_max)

However, despite this version’s cool use of list comprehensions and zip(), it loops through the dataset three times just for the system-level calculations, and I can’t think of a good way to do the cluster-level and node-level calaculations without doing something slow like:

timeline = defaultdict(int)
cuml_qty1 = defaultdict(int)
#...etc.

for c in cluster_list:
  timeline[c] = [time_get(t) for t in dataset if t.cluster == c]
  cuml_qty1[c] = [q1_get(t) for t in dataset if t.cluster == c]
  #...etc.

Does anyone here at Stack Overflow have suggestions for improvements? The first snippet above runs well for my initial dataset (on the order of a million records), but later datasets will have more records and clusters/nodes, so scalability is a concern.

This is my first non-trivial use of Python, and I want to make sure I’m taking proper advantage of the language (this is replacing a very convoluted set of SQL queries, and earlier versions of the Python version were essentially very ineffecient straight transalations of what that did). I don’t normally do much programming, so I may be missing something elementary.

Many thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-15T01:57:48+00:00Added an answer on May 15, 2026 at 1:57 am

    This seems like a classic opportunity to apply a little object-orientation. I would suggest making the derived data a class and abstracting the cumulative sum calculation to something which works on that class.

    Something like:

    class DerivedData(object):
        def __init__(self):
            self.qty1 = 0.0
            self.qty2 = 0.0
            self.combo = 0.0
            self.max = 0.0
            self.peak = Peak(time_point=0.0, qty1=0.0, qty2=0.0)
    
        def accumulate(self, data):
            self.qty1 += data.qty1
            self.qty2 += data.qty2
            self.combo = self.qty1 + self.qty2
            if self.combo > self.max:
                self.max = self.combo
                self.peak = Peak(time_point=data.time_point,
                                 qty1=self.qty1,
                                 qty2=self.qty2)
    
    sys = DerivedData()
    clusters = defaultdict(DerivedData)
    nodes = defaultdict(DerivedData)
    
    dataset.sort(key=operator.attrgetter('time_point'))
    
    for t in dataset:
        sys.accumulate(t)
        clusters[t.cluster].accumulate(t)
        nodes[t.node].accumulate(t)
    

    This solution abstracts out the logic for finding the peaks but still only goes through the data set once.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.