I was working previously with SAS and then decided to shift to R for academic requirements reasons.
My data (healthdemo) are health data containing some health diagnostic codes (ICD-10), I want to separate these codes into different columns. This is part of str(healthdemo):
$ PATIENT_KEY : int 7391510 7404298 7390196 7381208 7401691 7381223 7383005 10188634 7384574 7398317 ...
$ ICDCODE : Factor w/ 1125 levels "","H00","H00.0",..: 654 56 654 654 665 48 90 679 654 654 ...
$ PATIENT_ID : int 39387 50244 38388 27346 49922 27901 27867 61527 33186 45309 ...
$ DATE_OF_BIRTH : Factor w/ 14801 levels "","01/01/1000",..: 7506 10250 52 73 94 6130 85 2710 95 100 ...
the ICDCODE contains many diseases from H00 to J99; first, I separated the letters from numbers in the ICDCODE
healthdemo$icd_char = substr(healthdemo$ICDCODE,1,1)
healthdemo$icd_num = substr(healthdemo$ICDCODE,2,2)
then I created diseases columns by this function:
healthdemo$cvd = 0
healthdemo$ihd = 0
healthdemo$mi = 0
healthdemo$dys = 0
healthdemo$afib = 0
healthdemo$chf = 0
now I want to apply a function similar to this SAS function (that I used to use):
if icd_char = 'I' and 01 <= icd_num < 52 then cvd = 1;
if icd_char = 'I' and 20 <= icd_num <= 25 then ihd = 1;
if icd_char = 'I' and 21 <= icd_num <= 22 then mi = 1;
if icd_char = 'I' and 46 <= icd_num <= 49 then dys = 1;
if icd_char = 'I' and icd_num = 48 then afib = 1;
this function will assign each patient with the given ICD character and ICD-number into cvd=1 (e.g.) and so on.
I tried using these functions in R but they didnt work for me:
healthdemo$cvd[healthdemo$icd_char == 'I' & 01 <= healthdemo$icd_num
& healthdemo$icd_num < 52 ] <- 1
and this
if (healthdemo$icd_char == "I" & 01 < = healthdemo$icd_num < 52 )
{healthdemo$cvd <- 1}
Would somebody help me please ?
I had a similar struggle when I transitioned from SAS to R for health-related research. My solution was to, as much as possible, let go the “if…then” approach and take advantage of some of R’s unique native programming capabilities. Here are two approaches to your problem.
First, you can use indexing to find and replace elements. Here is some hospital discharge data of the kind you describe:
Say I want to identify every birth-related diagnosis in Manhattan. I first create a logical vector that returns a series of TRUES and FALSES for my search criteria, then I index my data frame by that logical vector. In this case I am also restricting the columns or variables I want returned:
The second, and perhaps more computationally elegant, approach is to use a function like “grep”. Say you’re interested in identifying all substance abuse diagnoses, e.g. alcohol abuse (291, 303, 305 and sub-codes), opioids, cannabis, amphetamines, hallucinogenics, and cocaine (304 and related sub-codes), or non-specific substance abuse-related diagnoses (292). In SAS you would write out a long if-then statement (or a more efficient array) of some kind:
In R, you can instead write:
Tomas Aragon has written a wonderful introduction to R for epidemiologists that goes into these approaches in detail. (http://www.medepi.net/docs/ph251d_fall2012_epir-chap01-04.pdf)