Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the Linear Regression functions in Oracle ?
Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
To the best of my knowledge, there is none. Writing one is pretty straightforward, though. The following gives you the constant alpha and slope beta for y = Alpha + Beta * x + epsilon:
Here
GroupIDis used to show how to group by some value in your source data table. If you just want the statistics across all data in the table (not specific sub-groups), you can drop it and the joins. I have used theWITHstatement for sake of clarity. As an alternative, you can use sub-queries instead. Please be mindful of the precision of the data type used in your tables as the numerical stability can deteriorate quickly if the precision is not high enough relative to your data.EDIT: (in answer to Peter’s question for additional statistics like R2 in the comments)
You can easily calculate additional statistics using the same technique. Here is a version with R2, correlation, and sample covariance:
EDIT 2 improves numerical stability by standardizing data (instead of only centering) and by replacing
STDEVbecause of numerical stability issues. To me, the current implementation seems to be the best trade-off between stability and complexity. I could improve stability by replacing my standard deviation with a numerically stable online algorithm, but this would complicate the implementation substantantially (and slow it down). Similarly, implementations using e.g. Kahan(-Babuška-Neumaier) compensations for theSUMandAVGseem to perform modestly better in limited tests, but make the query much more complex. And as long as I do not know how T-SQL implementsSUMandAVG(e.g. it might already be using pairwise summation), I cannot guarantee that such modifications always improve accuracy.