I am writing a function that computes conditional probability all columns in a pd.DataFrame that has ~800 columns. I wrote a few versions of the function and found a very big difference in compute time over two primary options:
col_sums = data.sum() #Simple Column Sum over 800 x 800 DataFrame
Option #1:
{‘col_sums’ and ‘data’ are a Series and DataFrame respectively}
[This is contained within a loop over index1 and index2 to get all combinations]
joint_occurance = data[index1] * data[index2]
sum_joint_occurance = joint_occurance.sum()
max_single_occurance = max(col_sum[index1], col_sum[index2])
cond_prob = sum_joint_occurance / max_single_occurance #Symmetric Conditional Prob
results[index1][index2] = cond_prob
Vs.
Option #2: [While looping over index1 and index2 to get get all combinations]
Only Difference is instead of using DataFrame I exported the data_matrix to a np.array prior to looping
new_data = data.T.as_matrix() [Type: np.array]
Option #1 Runtime is ~1700 sec
Option #2 Runtime is ~122 sec
Questions:
- Is converting the contents of DataFrames to np.array’s best for computational tasks?
- Is the .sum() routine in pandas significantly different to to .sum() routine in NumPy or is the difference in speed due to the label access to data?
- Why are these runtimes so different?
While reading the documentation I came across:
Best Guess:
Because I am accessing individual data elements many times from the dataframe (order of ~640,000 per matrix). I think the speed reduction came from how I referenced the data (i.e. “indexing with [] handles a lot of cases”) and therefore I should be using the get_value() method for accessing scalars similar to a matrix lookup.