I am writing a function that computes conditional probability all columns in a pd.DataFrame

Question

0

Asked: June 19, 20262026-06-19T01:08:58+00:00 2026-06-19T01:08:58+00:00

I am writing a function that computes conditional probability all columns in a pd.DataFrame

0

I am writing a function that computes conditional probability all columns in a pd.DataFrame that has ~800 columns. I wrote a few versions of the function and found a very big difference in compute time over two primary options:

col_sums = data.sum()   #Simple Column Sum over 800 x 800 DataFrame

Option #1:
{‘col_sums’ and ‘data’ are a Series and DataFrame respectively}

[This is contained within a loop over index1 and index2 to get all combinations]

joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob

Vs.

Option #2: [While looping over index1 and index2 to get get all combinations]
Only Difference is instead of using DataFrame I exported the data_matrix to a np.array prior to looping

new_data = data.T.as_matrix() [Type: np.array]

Option #1 Runtime is ~1700 sec
Option #2 Runtime is ~122 sec

Questions:

Is converting the contents of DataFrames to np.array’s best for computational tasks?
Is the .sum() routine in pandas significantly different to to .sum() routine in NumPy or is the difference in speed due to the label access to data?
Why are these runtimes so different?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-19T01:09:00+00:00

While reading the documentation I came across:

Section 7.1.1 Fast scalar value getting and setting Since indexing with [] must handle a lot of cases (single-label access, slicing,
boolean indexing, etc.), it has a bit of overhead in order to ﬁgure
out what you’re asking for. If you only want to access a scalar value,
the fastest way is to use the get_value method, which is implemented
on all of the data structures:

In [656]: s.get_value(dates[5])
Out[656]: -0.67368970808837059
In [657]: df.get_value(dates[5], ’A’)
Out[657]: -0.67368970808837059

Best Guess:
Because I am accessing individual data elements many times from the dataframe (order of ~640,000 per matrix). I think the speed reduction came from how I referenced the data (i.e. “indexing with [] handles a lot of cases”) and therefore I should be using the get_value() method for accessing scalars similar to a matrix lookup.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a function that computes conditional probability all columns in a pd.DataFrame

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply