I have some R code I need to port to python. However, R’s magic data.frame and ddply are keeping me from finding a good way to do this in python.
Sample data (R):
x <- data.frame(d=c(1,1,1,2,2,2),c=c(rep(c('a','b','c'),2)),v=1:6)
Sample computation:
y <- ddply(x, 'd', transform, v2=(v-min(v))/(max(v)-min(v)))
Sample output:
d c v v2
1 1 a 1 0.0
2 1 b 2 0.5
3 1 c 3 1.0
4 2 a 4 0.0
5 2 b 5 0.5
6 2 c 6 1.0
So here’s my question for the pythonistas out there: how would you do the same? You have a data structure with a couple of important dimensions.
For each (c), and each(d) compute (v-min(v))/(max(v)-min(v))) and associate it with the corresponding (d,c) pair.
Feel free to use whatever data structures you want, so long as they’re quick on reasonably large datasets (those that fit in memory).
Indeed pandas is the right (and only, I believe) tool for this in Python. It’s a bit less magical than plyr but here’s how to do this using the groupby functionality:
Now write a small transform function:
Note that this also handles NAs since the
vvariable is a pandasSeriesobject.Now group by the
dcolumn and apply f: