I would like to partially “collapse” a DataFrame/matrix and keep the structure intact by just summing the condensed values. For example, I have this:
CHROM POS GENE DESC JOE FRED BILLY SUSAN TONY
10 1442 LOXL4 bad 1 0 0 1 0
10 335 LOXL4 bad 1 0 0 0 0
10 3438 LOXL4 good 0 0 1 0 0
10 4819 PYROXD2 bad 0 1 0 0 0
10 4829 PYROXD2 bad 0 1 0 1 0
10 9851 HPS1 good 1 0 0 0 0
The first 4 columns are descriptors, and the last 4 columns are people/observations. The end goal is to count how many total “good” and “bad” observations per GENE per person. Thus, I want this:
GENE DESC JOE FRED BILLY SUSAN TONY
LOXL4 bad 2 0 0 1 0
LOXL4 good 0 0 1 0 0
PYROXD2 bad 0 2 0 1 0
HPS1 good 1 0 0 0 0
The following code collapses all the individual observations (Joe, Fred, etc), how can I keep them separate? I would also like to be flexible enough to accommodate a more individuals in the future (keeping the same 4 descriptor columns)
mytable.groupby(['GENE','DESC']).size()
Just use the aggregate method of the groupby object:
As mentioned by Daniel Velkow in the comment, the groupby object has some “build in” methods for simple aggregations like
sum,mean, … (something likeufuncsin numpy which are available as methods for numpy arrays). So the last step could be further simplified toIf you want different operations on each column, according to the docs you can pass a
dicttoaggregate.However I found no way to specify a function for a single column and use a default for others. So one way would be to define a custom aggregation function:
and than apply it by passing the function and the args to
agg: