I noticed some strange behavior when using IX on large pandas dataframes.
When I called .ix on the same dataframe 50 times in a row it ran 10 times faster than when I called .ix on 50 different dataframes.
Is there caching going on behind the scenes on .ix? I noticed that the bottom loop doubles my memory usage. Why would the memory be increasing?
Is there any way to modify this behavior?
Note that if you use straight up numpy it ran in 7.4 seconds in both cases with 0 memory increase, which is what led me to believe pandas was caching.
Obviously you never want to call .ix on each individual element…
import pandas as pd
import numpy as np
import datetime as dt
print 'pandas', pd.__version__
li_list = []
for i in range(50):
li_list.append(pd.DataFrame(data=np.random.randn(50, 17000)))
print 'starting'
dt_start = dt.datetime.now()
a = 0
for i in range(50):
b = li_list[0] #Only access first element
for j in b.columns:
a += b.ix[i, j]
print (dt.datetime.now()-dt_start).total_seconds()
dt_start = dt.datetime.now()
a = 0
for i in range(50):
b = li_list[i] #Access all in list
for j in b.columns:
a += b.ix[i, j]
print (dt.datetime.now()-dt_start).total_seconds()
Output:
pandas 0.9.1
starting
3.651
22.009
Note: there is a hash table population step the first time you look up a location in an axis index. That’s probably what you’re seeing here and would be obscured by using
timeit(because the hash table is computed once, stored, and reused). Also explains the increased memory usage.In a future version of pandas I plan to improve the performance of this type of code on simple data with simple sequential axis indexes. I’ll record your use case on the GitHub issue tracker.
https://github.com/pydata/pandas/issues/2420