Currently I am facing the following problem, which I’m working in Stata to solve. I have added the algorithm tag, because it’s mainly the steps that I’m interested in rather than the Stata code.
I have some variables, say, var1 – var20 that can possibly contain a string. I am only interested in some of these strings, let us call them A,B,C,D,E,F, but other strings can occur also (all of these will be denoted X). Also I have a unique identifier ID. A part of the data could look like this:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | | X
1 | | A | | | C
2 | X | F | A | |
8 | | | | | E
Now I want to create an entry for every ID and for every occurrence of one of the strings A,B,C,E,D,F in any of the variables. The above data should look like this:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | .. |
1 | | A | | |
1 | | | | | C
2 | | F | | |
2 | | | A | |
8 | | | | | E
Here we ignore every time there’s a string X that is NOT A,B,C,D,E or F. My attempt so far was to create a variable that for each entry counts the number, N, of occurrences of A,B,C,D,E,F. In the original data above that variable would be N=1,2,2,1. Then for each entry I create N duplicates of this. This results in the data:
ID | var1 | var2 | var3 | .. | var20
1 | E | | | | X
1 | | A | | | C
1 | | A | | | C
2 | X | F | A | |
2 | X | F | A | |
8 | | | | | E
My problem is how do I attack this problem from here? And sorry for the poor title, but I couldn’t word it any more specific.
Sorry, I thought the finally block was your desired output (now I understand that it’s what you’ve accomplished so far). You can get the middle block with two calls to
reshape(long, thenwide).First I’ll generate data to match yours.
Now the two calls to
reshape.The first
reshapeconverts your data from wide to long. Thevarspecifies that the variables you want to reshape to long all start withvar. Thei(n id)specifies that each unique combination ofnandiis a unique observation. Thereshapecall provides one observation for eachn–idcombination for each of yourvar1throughvar20variables. So now there are 4*20=80 observations. Then I keep only the strings that you’d like to keep withinlist().For the second
reshapecallvarspecifies that the values you’re reshaping are in variablevarand that you’ll use this as the prefix. You wanted one row per remaining letter, so I made a new index (that has no real meaning in the end) that becomes theiindex for the secondreshapecall (if I usedn–idas the unique observation, then we’d end up back where we started, but with only the good strings). Thejindex remains from the firstreshapecall (variable_j) so thereshapealready knows what suffix to give to eachvar.These two
reshapecalls yield:You can easily add back variables that don’t survive the two
reshapes.