I have a rather big dataset (2678271, 52) and a 5-dimensional index which consumes 6.5% of the machine’s memory.
When I call
df.sortlevel(k)
I receive the following error:
MemoryError Traceback (most recent call last)
in ()
----> 1 df = df.sortlevel(4)
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in sortlevel(self, level, axis, ascending)
2978 raise Exception('can only sort by level with a hierarchical index')
2979
-> 2980 new_axis, indexer = the_axis.sortlevel(level, ascending=ascending)
2981
2982 if self._data.is_mixed_dtype():
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in sortlevel(self, level, ascending)
1856 indexer = _indexer_from_factorized((primary,) + tuple(labels),
1857 (primshp,) + tuple(shape),
-> 1858 compress=False)
1859 if not ascending:
1860 indexer = indexer[::-1]
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _indexer_from_factorized(labels, shape, compress)
2124 max_group = np.prod(shape)
2125
-> 2126 indexer, _ = lib.groupsort_indexer(comp_ids.astype(np.int64), max_group)
2127
2128 return indexer
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/lib.so in pandas.lib.groupsort_indexer (pandas/src/tseries.c:55052)()
MemoryError:
Is there a hard-coded condition which throws this error? Or is it possible that even though the data only uses 6.5% of the memory (according to htop) the operation eats up the remaining memory?
can you move this to GitHub? I need to review the code but there are a number of edge cases where I didn’t test really deeply-“leveled” hierarchical indexes. So this is probably a legitimate bug.
EDIT: this has been fixed in v0.10.1