I am working on a research project with baseball data from retrosheet.org. I want to create variables for the score of each team in each inning (Vis1, Home1, Vis2, Home2, etc). The problem is that the variable for the box score is coded strangely. Each team has its own variable for the whole game and each inning gets one value. Because leading zeros are cut off a value of “12(10)1X” would mean that a team did not score in the first 4 innings, scored once in the fifth, twice in the sixth, ten times in the seventh, once in the eighth, and they did not have to play the ninth because they had won by that point.
Any advice? I’m at a loss. The () confuse me the most.
I’m irish and live in wales and have no clue about baseball but, I think I remember hearing that there can only be a maximum of 9 innings???? (honestly… no clue!!!)
Of course you’ll have to do some furhter manipulations to get them into numbers and deal with X’s and also this will break if there are any other unexpected characters, in addition to being beholden to the assumption of 9 innings.
EDIT:
I removed the stipulation for 9 innings and show how to do this for the entire column (assuming that scores you spoke of are indeed the 20th variable in the csv file). Extra porcessing is required for different number of innings.
do.call(rbind,...)won’t work. find the longest game and append"X"‘s to the end to make them all the same length? Maybe? I’m not sure but I think this question has been answered at least.