Gyasi's Blog
No Result
View All Result
  • Home
  • Categories
    • Actuarial Science
    • Data Science
    • Philosophy
    • Economics
  • About
  • Contact
Gyasi's Blog
No Result
View All Result

Two Notes About the Two Faces of R-Squared

by Gyasi Dapaa
November 25, 2019
in Data Science
Reading Time: 9min read
0
Two Notes About the Two Faces of R-Squared
Share on FacebookShare on TwitterShare on Whatsapp

R-squared can be estimated in two main ways:

  1. Ratio of variance of predictions, \widehat{Y}, to that of the response variable, Y.R^{2}=\frac{var(\widehat{Y})}{var(Y)}---(1)
  2. A difference between unity and the ratio of variance of residual error to that of the response variable
    R^{2}=1-\frac{var(\varepsilon )}{var(Y)}---(2)where \varepsilon =Y - \widehat{Y}.

For easy communication sake, I’ll respectively refer to (1) and (2) as methods 1 and 2.

These two methods yield identical results and are effective linear model validation measures but only under two conditions:

  1. In-sample validation: when they are computed with the same data on which the models were fitted; and
  2. When the model parameters directly estimated from data are used in the R-squared computations.

However, in real world, at least one of the above conditions is almost always violated: It’s recommended practice for scientists to rather validate models on unseen data (i.e. out-of-sample validation). Most model validation in the machine learning era involve computing goodness of fit metrics on unseen data using parameters from competing models. Also, it’s common for scientists to select away from model parameters estimated from data. For instance, in insurance, an actuary can adjust any subset of the estimated model factors for reasons related to marketing, underwriting, regulation or any other he or she deems relevant.

When at least one of the above two conditions is violated, the two R-squared methods, contrary to what have been discussed in statistical textbooks, yield different results, some of which are too consequential to ignore. This paper discusses two of them.

Note 1

The choice of method has two critical consequences on the scientist’s assessment of model fit:

The first is method 1 can produce inflated R^{2} values and hence overconfidence in the model’s efficacy. To see this, assume:

\widehat{Y}=\beta _{0}+\beta _{1}x _{1}+\beta _{2}x _{2}---(3)

Assuming further that (\beta_{1},\beta_{2}) are the true parameter values, and for simplicity sake, that x_{1} and x_{2} are independent covariates, then:

var(\widehat{Y}) = \beta_{1}^{2}var(x_{1}) + \beta_{2}^{2}var(x_{2})---(4)

Since the denominator of method 1, var(Y), is constant for any given dependent variable, it’s troubling to see that, the numerator, var(\widehat{Y}) and hence R^{2} can be increased by merely selecting parameter estimates that are otherwise different from the true parameters (\beta_{1}, \beta_{2}) but larger in absolute magnitude. Particularly, if we predict Y with larger but inaccurate parameters such that (|\alpha_{1}|,|\alpha_{2}|)>(|\beta_{1}|,|\beta_{2}|):

\widehat{Y_{2}} =\alpha_{0} + \alpha_{1}x_{1} + \alpha_{2}x_{2}---(5)

then var(\widehat{Y_{2}}) > var(\widehat{Y}), and consequently, R^{2} associated with the inaccurate model in (5) exceeds that of (3). Since this is the first time this problem is introduced, I’ll call it the  Parameter- inflated R2 Problem. 

Another way (1) can yield false conclusions is by the addition of extraneous covariates—explanatory variables that has no relationship with the dependent variable. Suppose one such extraneous variable, z, is added to the model specified in (3): \widehat{Y_{3}}=\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}+\beta_{3}z---(6)

One can see that, even though the added variable has no statistical relationship with Y, R^{2} calculated via method 1 will nevertheless increase if \beta_{3} is non-zero. Additionally, if \beta_{3} and the variance of z are high enough, the complexity penalty baked into adjusted R^{2} may not be enough to offset the artificial R^{2} boost that z provides; when this happens, both R^2 and adjusted R^2  increase. I call this the Extraneous Variable Boost Problem. This problem is broader than the classical over-fitting problem as \beta_3 may not necessarily have been fitted from data. And also, even though the Extraneous Variable Boost Problem poses a risk to only method 1 (as we’ll see below), none of the two methods is immune to over-fitting problem.

These two problems imply that if the modeler compares different models using method 1, there is a concerning possibility that he or she may choose a model with a higher R^2 but a poorer fit. Hence, using method 1 to choose the best model is a dangerous validation approach especially in a modeling aura when the analyst is free to select away from the regression estimates. It also means that in contests where models are judged by method 1 R^2, contestants can cheat by reporting larger parameters and extraneous variables.

Method 2 is free from the two problems discussed above. Because R^2 is calculated as the difference between 1 and the ratio of variance of the residual error to that of Y, any inaccuracies or noise in the predictive model augments the subtrahend (i.e. the ratio) and accordingly reduces the difference (i.e. R^2). In the case of (5), variance of the the residual error will duly increase by \sum_{k=1}^{2}(\alpha_k -\beta_k)^{2}.var(x_k); and in (6), it will increase by (\beta_3)^{2}.var(z). Before this paper, the two methods had been regarded as equivalent, and the superiority of method 2 had been missed.

Note 2

The second note is that, even though methods 1 and 2 seek to measure the same thing, the two measures have different variances. Hence, using the variance of one method to make inference about the other will yield false conclusions. To show this, consider the following model:

Y=0.5+0.6x_1-0.8x_2+\varepsilon ---(7)

where x1,x2 ~iid Normal(0,1), and \varepsilon ~Normal(0,3) and independent of x1, and x2. 

If the above model is simulated 200 times (with each experiment having a sample size of 3000), and the two R^2s are computed for each experiment, we observe  the following respective distributions:

As can be evidenced from figures 1 and 2, though the two methods yield statistically equal means, the variance of method 2 is twice that of method 1. The reader can infer from the variance formulas below that, whenR^2 is less (greater) than 0.5, method 1 will have a lower (larger) variance. This disparity in variance, despite the popularity of these two methods, has heretofore not been highlighted. The ramification is that, in making inferences about R^2,  the analyst must use the correct variance for the method he or she chooses. The interested reader should see appendix for the computation of variance of R^2 for both methods.

Conclusion

My two notes are thus these: Use method 2 to compute R^2 to avoid the Parameter- inflated R2  and Extraneous Variable Boost Problems; and make sure to use the correct variance for your chosen method in making inference about R2 .

Appendix

Variance of Method 1

The variance of sample r^2 calculated by method 1 (i.e. r^2=\frac{s_{\widehat{Y}}^{2}}{s_{Y}^{2}}) can be derived using the delta method as follows:

Var\left ( \frac{s_{\widehat{Y}}^{2}}{s_{Y}^{2}} \right )\cong \frac{\sigma _{\widehat{Y}}^{4}}{\sigma _{Y}^{4}} \left ( \frac{\mu _{4,\widehat{Y}}}{N\sigma _{\widehat{Y}}^{4}} +\frac{\mu _{4,Y}}{N\sigma _{Y}^{4}}-\frac{2}{N}-\frac{2Cov(s_{\widehat{Y}}^{2},s_{Y}^{2})}{\sigma _{\widehat{Y}}^{2}\sigma_{Y}^{2}}\right )---(8)

Where: 

\sigma_{Y}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\overline{Y})^{2}},\sigma_{\widehat{Y}}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(\widehat{Y}_{i}-\overline{Y})^{2}}

\mu_{4,Y}=\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\overline{Y})^{4},\mu_{4,\widehat{Y}}=\frac{1}{N}\sum_{i=1}^{N}(\widehat{Y}_{i}-\overline{Y})^{4}

Cov(s_{\widehat{Y}}^{2},s_{Y}^{2})=\left ( \frac{1}{N^2} \sum_{i=1}^{N}E[(\widehat{Y_i}-\overline{Y})^2(Y_i-\overline{Y})^2]\right )-\frac{\sigma_{\widehat{Y}}^{2}\sigma_{Y}^{2}}{N}---(9)

In the same vein, the variance of the sample r^2 measured by method 2 (i.e. r^2=1-\frac{s_{\varepsilon }^{2}}{s_{Y}^{2}}) can be derived by substituting  \widehat{Y} with \varepsilon in the above formulas. One can thus see that the variance of the two measures will be different except for R^2=0.5 (i.e. Var(\widehat{Y})=var(\varepsilon )). For a full derivation of (8), see my other paper, G2R2: A True R-Squared Measure for Linear and Non-Linear Models.

References

  1. Greene, W.H. (2002) Econometric Analysis. 5th Edition, Prentice Hall, Upper Saddle River, 802.
  2. Kvalseth, T.O. (1985). “A Cautionary Note About R-Squared”. The American Statistician. 39. pp. 279-285.
  3. Weisstein, Eric W. “Sample Variance Distribution.” From MathWorld–A Wolfram Web Resource, http://mathworld.wolfram.com/SampleVarianceDistribution.html

 

Previous Post

Ratemaking Reformed:

Next Post

G2R2: A True R-squared Measure for Linear and Non-Linear Predictive Models

Related Posts

G2R2: A True R-squared Measure for Linear and Non-Linear Predictive Models
Data Science

G2R2: A True R-squared Measure for Linear and Non-Linear Predictive Models

November 30, 2019

Abstract This paper introduces a framework that allows us to construct an all-purpose true R-squared measure, G2R2. Unlike the other...

A Common Subtle Error:
Actuarial Science

A Common Subtle Error:

October 24, 2019

Using Maximum Likelihood Tests to Choose Between Different Distributions This article was first published in the summer 2012 edition of...

GodsGutstoScience
Data Science

From Gods, Guts to Science: The Golden Era of Data Analytics In Decision-Making

August 29, 2019

This article was first published on June 27 2016 on my linkedin page: https://www.linkedin.com/pulse/from-gods-guts-science-golden-era-data-analytics-gyasi-dapaa/ Part I: What's the Craze? The...

poll
Data Science

Were the Polls Wrong or We Were?: How Trump’s Win Was Way Within the Whims of the Polls!

August 29, 2019

This article was first published on November 13, 2016 on my linkedin page: https://www.linkedin.com/pulse/were-polls-wrong-we-how-trumps-win-way-within-whims-gyasi-dapaa/ Last Tuesday’s (Nov 8, 2016) elections...

five
Data Science

The Five Discomforts of the Business With Analytics:Understand them for Seamless Propagation of Analytics Across Your Enterprise

August 29, 2019

This article was first published on February 26, 2017 on my linkedin page: https://www.linkedin.com/pulse/five-discomforts-business-analyticsunderstand-them-seamless-dapaa/ One major headache of almost all...

cuter
Data Science

The Four Golden Attributes of 21st Century Products: C.U.T.E!

August 29, 2019

This article was first published on Nov 4 2017 on my linkedin page: https://www.linkedin.com/in/gyasi-dapaa-520778123/detail/recent-activity/posts/ Humanity is run by three ideals:...

Next Post
G2R2: A True R-squared Measure for Linear and Non-Linear Predictive Models

G2R2: A True R-squared Measure for Linear and Non-Linear Predictive Models

insurancepremiums

The Covid Insurance Refund

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Gyasi's Blog

  • Home
  • About Me
  • Contact

© 2021 Gyasi's Blog

No Result
View All Result
  • Home
  • Actuarial Science
  • Data Science
  • Economics
  • Philosophy
  • About Me
  • Contact

© 2021 Gyasi's Blog