This question is about filtering a NumPy ndarray according to some column values.
I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.
The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").
The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.
Here is what I do now:
index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]
List comprehension like:
list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)
wouldn’t work because the dtypes I originally had are no longer accessible.
Are there any better / more Pythonic way of achieving the same result?
Something that could look like:
filtered_data = data.where({'category_code': ('A', 'B','C'})
Thanks!
You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:
Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column
To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:
Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a “,”, and using “:” to indicate that you want all of the values (columns) in the other dimension:
In Pandas, you call the the data frame’s ix method, and place only the index inside the brackets: