R-squared can be estimated in two main ways:
- Ratio of variance of predictions,
, to that of the response variable, Y. - A difference between unity and the ratio of variance of residual error to that of the response variable
where .
For easy communication sake, I’ll respectively refer to (1) and (2) as methods 1 and 2.
These two methods yield identical results and are effective linear model validation measures but only under two conditions:
- In-sample validation: when they are computed with the same data on which the models were fitted; and
- When the model parameters directly estimated from data are used in the R-squared computations.
However, in real world, at least one of the above conditions is almost always violated: It’s recommended practice for scientists to rather validate models on unseen data (i.e. out-of-sample validation). Most model validation in the machine learning era involve computing goodness of fit metrics on unseen data using parameters from competing models. Also, it’s common for scientists to select away from model parameters estimated from data. For instance, in insurance, an actuary can adjust any subset of the estimated model factors for reasons related to marketing, underwriting, regulation or any other he or she deems relevant.
When at least one of the above two conditions is violated, the two R-squared methods, contrary to what have been discussed in statistical textbooks, yield different results, some of which are too consequential to ignore. This paper discusses two of them.
Note 1
The choice of method has two critical consequences on the scientist’s assessment of model fit:
The first is method 1 can produce inflated
Assuming further that
Since the denominator of method 1, var(Y), is constant for any given dependent variable, it’s troubling to see that, the numerator,
then
Another way (1) can yield false conclusions is by the addition of extraneous covariates—explanatory variables that has no relationship with the dependent variable. Suppose one such extraneous variable, z, is added to the model specified in (3):
One can see that, even though the added variable has no statistical relationship with Y,
These two problems imply that if the modeler compares different models using method 1, there is a concerning possibility that he or she may choose a model with a higher
Method 2 is free from the two problems discussed above. Because
Note 2
The second note is that, even though methods 1 and 2 seek to measure the same thing, the two measures have different variances. Hence, using the variance of one method to make inference about the other will yield false conclusions. To show this, consider the following model:
where x1,x2 ~iid Normal(0,1), and
If the above model is simulated 200 times (with each experiment having a sample size of 3000), and the two
As can be evidenced from figures 1 and 2, though the two methods yield statistically equal means, the variance of method 2 is twice that of method 1. The reader can infer from the variance formulas below that, when
Conclusion
My two notes are thus these: Use method 2 to compute
Appendix
Variance of Method 1
The variance of sample
Where:
In the same vein, the variance of the sample
References
- Greene, W.H. (2002) Econometric Analysis. 5th Edition, Prentice Hall, Upper Saddle River, 802.
- Kvalseth, T.O. (1985). “A Cautionary Note About R-Squared”. The American Statistician. 39. pp. 279-285.
- Weisstein, Eric W. “Sample Variance Distribution.” From MathWorld–A Wolfram Web Resource, http://mathworld.wolfram.com/SampleVarianceDistribution.html