I have an assignment in my C programming class to write a program to get the correlation coefficient of 2 sets of real numbers. I’ve been given the equations, and it referenced wikipedia so I double checked the equations there. Here is a link to the equation, which seems to be pretty standard from my research:

I’ve written the program, but when I ran it I was getting numbers greater than 1 for my results, which I knew wasn’t correct. I looked over my code several times but couldn’t find anything out of place, so I tried dividing by n at the end instead of n-1, this gave me values with the -1 to 1 range that I expected, so i tested it against data values that I found online as well as a correlation coefficient calculator ( http://easycalculation.com/statistics/correlation.php ) and I’m now getting correct results for all of the numbers I input. I can’t figure out why this is, so thought I might be able to get a little help with it here. Here is my code for the program, If there is anything else that stands out that I have done wrong here I would love to hear some advice, but mostly I’m trying to figure out why I’m getting the right results with what appears to be the wrong equation.
It will then read in the values for both arrays(x and y), and then computes
the correlation coefficient between the 2 sets of numbers.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(void) {
int n; /* value to determine array length */
/* declare variables to hold results for each equation for x and y
initialize all to zero to prepare for summation */
float r = 0.0, xbar = 0.0, ybar = 0.0, sx = 0.0, sy = 0.0;
/*get number n input from user */
printf("Please enter a number n: ");
scanf("%d", &n);
if( n < 1) {
printf("n must be a positive number.\nPlease enter a new value: ");
scanf("%d", &n);
if( n < 1) {
printf("Invalid input, exiting...\n");
return 0;
}
}
/*initialize arrays x and y with length of n */
float x[n], y[n];
/*use for loop to read in values of x*/
int i;
for(i = 0; i < n; ++i) {
printf("Please enter a number for x: ");
scanf("%f", &x[i]);
}
/*use for loop to read in values of y*/
for(i = 0; i < n; ++i) {
printf("Please enter a number for y: ");
scanf("%f", &y[i]);
}
/*compute xbar */
for(i = 0; i < n; ++i) {
xbar += x[i];
}
xbar /= n;
/*compute ybar*/
for(i = 0; i < n; ++i) {
ybar += y[i];
}
ybar /= n;
/* compute standard deviation of x*/
for(i = 0; i < n; ++i) {
sx += (x[i] - xbar) * (x[i] - xbar);
}
sx = sqrt((sx / n));
/* compute standard deviation of y */
for(i = 0; i < n; ++i) {
sy += (y[i] - ybar) * (y[i] - ybar);
}
sy = sqrt((sy / n));
/*compute r, the correlation coefficient between the two arrays */
for( i = 0; i < n; ++i ) {
r += (((x[i] - xbar)/sx) * ((y[i] - ybar)/sy));
}
r /= (n); /* originally divided by n-1, but gave incorrect results
dividing by n instead produces the desired output */
/* print results */
printf("The correlation coefficient of the entered lists is: %6.4f\n", r);
return 0;
}
(it looks like my code formatting isn’t working, very sorry about this. Tried using tags and the button but can’t figure it out. It looks like I got it working somewhat, better than before.)
You are calculating your standard deviation as:
and similarly for
sy.The equation you have used uses
n-1in the denominator for calculating this (reason: there aren-1degrees of freedom, so you should divide byn-1). So, yoursxandsyare actuallysx'andsy', wheresx' = sx*sqrt(n-1)/sqrt(n), andsy' = sy*sqrt(n-1)/sqrt(n). So,sx' * sy' = sx * sy * (n-1)/n. Sincesx*syis in the denominator, your calculation is off by a factor ofn/(n-1). Dividing this byngives you the factor you need outside of the summation.So if you changed your code to calculate the sample standard deviation (divide by
n-1), you can finally divide byn-1and your code will get the result you expect. For efficiency, since the division is going to cancel out anyway, you can save some computation and increase your accuracy by simply not dividing byn-1in calculations ofsxandsy, and then omit the final division as well:become
and:
goes away altogether.
Edit: Since you asked…
floatunless you have to.doublegives you much better precision.stdoutis line buffered on most systems, so your prompt may not appear before your call toscanf(). To make sure your prompt shows, dofflush(stdout);after yourprintf()call.scanf()safely. For reading numbers,scanf()has undefined behavior when someone enters a number that’s not in the range of the data type. Also, it is bad for cases like when someone enters a non-integer in response to your prompt. For your case, you can makenpassable as a command-line parameter, and then usestrtol(argv[1])to parse the number. If you want to read fromstdinanyway, usefgets() + sscanf()combination, orfgets() + strtol().xbarandybarin the same loop. Even better, you can write a functiondouble avg(double *data, int n), that calculates average ofnvalues, and then do:xbar=avg(x, n);,ybar=avg(y, n);.double std(double *data, int n), and then use that to calculatesxandsy.sqrt((sx / n));is better written assqrt(sx / n);.r /= (n);doesn’t need the parentheses either.