Why are “lm” and “biglm” producing different estimates? Consider the code below:
a = as.data.frame(cbind(y=rnorm(1000000), x1=rnorm(1000000), x2=rnorm(1000000)))
m1 = lm(y ~ x1 + x2, data=a); summary(m1)
library(biglm)
m2 = biglm(y ~ x1 + x2, data=a); summary(m2)
It makes no difference if biglm processes in chunks or not – the final estimates are different from that produced by lm.
Posting as answer simply due to length:
So, yes, the coefficients are slightly different. For example, the intercepts differ by 0.2% . Whether this sort of difference has any effect on the quality of your fitted line depends rather a lot on what you intend to do with your fit. Integration? guaranteed no problem. Extrapolation? always risky, but not because the slopes differ by 0.5% .
I would strongly recommend that at the very least you run some test cases which fit, say
f(x) = g(x) +runif(N) ; h(x)= g(x) +runif(N) #runif will return different sets of RVs,and see if lm and biglm return significantly different coefficients from the original g(x) values.