I want to perform winsorization in a dataframe like this:
event_date beta_before beta_after
2000-05-05 1.2911707054 1.3215648954
1999-03-30 0.5089734305 0.4269575657
2000-05-05 0.5414700258 0.5326762272
2000-02-09 1.5491034852 1.2839988507
1999-03-30 1.9380674599 1.6169735009
1999-03-30 1.3109909155 1.4468207148
2000-05-05 1.2576420753 1.3659492507
1999-03-30 1.4393018341 0.7417777965
2000-05-05 0.2624037804 0.3860641307
2000-05-05 0.5532216441 0.2618245169
2000-02-08 2.6642931822 2.3815576738
2000-02-09 2.3007578964 2.2626960407
2001-08-14 3.2681270302 2.1611010935
2000-02-08 2.2509121123 2.9481325199
2000-09-20 0.6624503316 0.947935581
2006-09-26 0.6431111805 0.8745333151
By winsorization I mean to find the max and min for beta_before for example. That value should be replaced by the second highest or second lowest value in the same column, without loosing the rest of the details in the observation. For example. In this case, in beta_before the max value is 3.2681270302 and should be replaced by 3.2681270302. The same process will be followed for the min and then for the beta_after variable. Therefore, only 2 values per column will be changes, the highest and the minimum, the rest will remain the same.
Any advice? I tried different approaches in plyr, but I ended up replacing the whole observation, which I don’t want to do. I would like to create 2 new variables, for example beta_before_winsorized and beta _after_winsorized
Here is a function that does the winsorzation you describe:
If you data are in a data frame
dat, then we can windsoroize the data using your procedure via:which results in:
I’m not sure where you got the value you suggest should replace the max in
beta_beforeas the second highest is2.6642932in the snippet of data provided and that is what my function has used to replace with the maximum value with.Note the function will only work if there is one minimum and maximum values respectively in each column owing to the way
which.min()andwhich.max()are documented to work. If you have multiple entries taking the same max or min value then we would need something different:should do it (latter is not tested).