R squared in Machine Learning

The \(R^2\) metric, often called the coefficient of determination, is one of the most widely used measures for evaluating regression models. At its core, \(R^2\) tells us how well the model explains the variability of the dependent variable relative to a very simple baseline: the mean of that variable. To understand it deeply, let us start from the ground up.

Suppose we have a dataset with a target variable \(y\), and our regression model produces predictions \(\hat{y}\). If we did not have any model at all, the best we could do to “predict” \(y\) would be to use its mean, \(\bar{y}\), for every data point. This simple strategy captures no nuance of the data, but it provides a baseline for comparison. The total variability in the data, called the total sum of squares (\(SS_{tot}\)), measures how much the actual values deviate from the mean. Mathematically, it is written as

\[SS_{tot} = \sum_i (y_i - \bar{y})^2.\]

Now, when we bring in a model, it produces predictions \(\hat{y}_i\). Naturally, those predictions won’t always be perfect, and the deviations of the predictions from the true values are called residuals. The sum of squared residuals (\(SS_{res}\)) captures how much unexplained error remains after using the model:

\[SS_{res} = \sum_i (y_i - \hat{y}_i)^2.\]

The magic of \(R^2\) lies in how it compares these two quantities. Specifically,

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}}.\]

If the model’s predictions are perfect, then the residual sum of squares vanishes to zero, giving \(R^2 = 1\). This indicates that the model explains all the variability of the data. On the other hand, if the model is no better than just predicting the mean, then \(SS_{res} = SS_{tot}\), and \(R^2\) becomes zero. Intriguingly, \(R^2\) can even be negative. This happens when the model is worse than the mean, in the sense that the residual errors are larger than the variability of the data itself. In such cases, the model is actively harmful as an explanatory tool.

The intuitive interpretation of \(R^2\) is that it represents the fraction of variance in the dependent variable that is explained by the independent variables. If you think of variance as the “spread” or unpredictability in the data, then a model’s job is to account for as much of that spread as possible by relating it to explanatory features. For instance, if \(R^2 = 0.7\), it suggests that 70% of the variance in the outcome can be explained by the model, while the remaining 30% is still noise or unexplained. This interpretation makes \(R^2\) appealing because it ties the effectiveness of the model directly to the concept of variance explanation.

However, one must also be cautious. \(R^2\) only measures variance explained relative to the mean model. It does not tell you whether the model is correct in a predictive sense, nor does it penalize overfitting directly. For example, adding more features to a model will never decrease \(R^2\); it can only stay the same or increase, even if the additional features have no real explanatory power. This is why adjusted \(R^2\) is often introduced, which penalizes the inclusion of unnecessary predictors by taking into account the number of features relative to the number of data points.

Another subtle point is that \(R^2\) assumes that variance is the right quantity to explain. This makes sense in linear regression, where the goal is indeed to reduce squared error, but in contexts like nonlinear models or when distributions are highly skewed, the variance explanation picture may not fully align with predictive accuracy. For example, a model might achieve a high \(R^2\) but still perform poorly in predicting new data if it overfits. Similarly, in time series where temporal dependence is critical, variance explanation might be misleading without proper validation.

So, in essence, \(R^2\) is a measure of how much better your model is compared to a naive mean predictor in terms of explaining variance in the target variable. It provides a normalized sense of fit: \(R^2 = 1\) means perfect explanation, \(R^2 = 0\) means no better than chance, and negative values indicate worse than chance. Thinking of it through the lens of variance explanation grounds it in the idea of “how much of the spread in the data have we accounted for?”—but it is always best interpreted alongside other metrics and validation strategies.

Good. Let us now step into the geometric view, because it enriches our intuition for what \(R^2\) truly measures. Regression can be understood not only as a statistical minimization of squared error but also as a geometric projection in a high-dimensional vector space.

Imagine your dataset of responses \(y = (y_1, y_2, \dots, y_n)\) as a vector sitting in an \(n\)-dimensional space. Each coordinate represents one observation. When we perform regression with predictors \(X\), what we are really doing is trying to find another vector \(\hat{y}\) that lies in the subspace spanned by the columns of \(X\). In other words, we are projecting the outcome vector \(y\) onto the space formed by linear combinations of the predictors. The projection gives us the fitted values \(\hat{y}\), while the leftover piece—the residuals \(y - \hat{y}\)—is orthogonal to that subspace.

Now, why is this picture powerful for understanding \(R^2\)? Because variance explained corresponds to how much of the “length” (technically, squared norm) of \(y\) is captured in the projection. The total variability of \(y\) around its mean can be written as the squared length of the centered vector \(y - \bar{y}\mathbf{1}\), where \(\mathbf{1}\) is the all-ones vector. That is the total sum of squares, \(SS_{tot}\). The part explained by the regression is the squared length of the projection of this centered vector onto the column space of \(X\). That is called the regression sum of squares, \(SS_{reg}\). The residual sum of squares \(SS_{res}\) is simply the squared length of the orthogonal residual vector. By Pythagoras, these satisfy the neat identity

\[SS_{tot} = SS_{reg} + SS_{res}.\]

And from this decomposition, you see that

\[R^2 = \frac{SS_{reg}}{SS_{tot}},\]

which is literally “how much of the squared length is explained by the projection.”

This geometry also reveals another interpretation: \(R^2\) is the square of the correlation coefficient between \(y\) and \(\hat{y}\). If you think of correlation as measuring alignment between two vectors, then \(R^2\) measures the degree to which the predicted vector lies in the same direction as the true vector. Perfect alignment gives correlation \(1\) and thus \(R^2 = 1\). A poor model, on the other hand, produces predictions that are only weakly aligned with the actual responses, giving a small \(R^2\). Negative \(R^2\) in this picture corresponds to the situation where the projection chosen by the model actually misaligns the predicted vector in such a way that it increases squared error compared to the mean baseline.

So geometrically, variance explainability means this: you start with the cloud of data points in high-dimensional space, you draw the straightest line or hyperplane you can through them (given by the regression model), and you measure how much of the original data’s spread is captured along that line. The closer your data vector \(y\) lies to the subspace spanned by your predictors, the more variance you have explained, and the higher your \(R^2\).

This perspective unifies the algebraic and statistical definitions. From one side, \(R^2\) is “1 minus unexplained variance over total variance.” From the other, it is the squared cosine of the angle between the true outcomes and the predictions. Both tell the same story: it quantifies alignment, projection, and variance accounted for by the model.


let’s now move from \(R^2\) to its refined cousin: adjusted \(R^2\). The motivation for this adjustment emerges from a subtle flaw in plain \(R^2\). Remember that \(R^2\) never decreases as you add more predictors to a model. Even if the new variable has no true relationship with the target, the mere act of giving the model more flexibility allows it to fit the data slightly better, thereby reducing the residual sum of squares. This means that \(R^2\) is biased toward models with more features, and left unchecked, it can reward overfitting.

Adjusted \(R^2\) was introduced to correct this. Its guiding idea is simple: yes, adding predictors can reduce error, but unless they genuinely improve explanatory power, they should be penalized for consuming degrees of freedom. The formula makes this precise. If you have \(n\) data points and \(p\) predictors, then adjusted \(R^2\) is defined as

\[R^2_{adj} = 1 - \frac{SS_{res}/(n-p-1)}{SS_{tot}/(n-1)}.\]

Notice the two denominators: instead of just comparing raw sums of squares, we are now comparing mean squared residuals per degree of freedom. The denominator \(n-1\) corresponds to the total variability after estimating a single mean, while the numerator \(n-p-1\) corresponds to the leftover variability after fitting \(p\) predictors. In essence, this formula asks: how much better is the model than the mean predictor, once we take into account the “cost” of the parameters used?

The behavior of adjusted \(R^2\) is revealing. If a new predictor improves the model enough that the reduction in residual variance outweighs the penalty of losing a degree of freedom, adjusted \(R^2\) will rise. But if the predictor does not help much, the penalty dominates and adjusted \(R^2\) will actually fall. This makes it a more balanced tool for comparing models of different complexity.

Another way to view it is through the lens of variance explanation again. While \(R^2\) asks “what fraction of variance do we explain,” adjusted \(R^2\) sharpens the question to “what fraction of variance do we explain per unit of explanatory effort?” It acknowledges that variance can be explained trivially by throwing in more variables, but meaningful explanation comes only when the gain surpasses the cost.

It is important, however, to keep adjusted \(R^2\) in perspective. It is not a panacea. It still assumes that linear regression is the right modeling framework and that squared error is the right measure of fit. It is also influenced by sample size: with a small \(n\), the penalty for adding predictors is heavy, while with very large \(n\), adjusted \(R^2\) behaves more like plain \(R^2\). Nonetheless, within the family of linear regression comparisons, it is often the preferred metric because it guards against the seductive but misleading climb of \(R^2\) as more predictors are introduced.

So, if you think of \(R^2\) as a measure of variance explanation in absolute terms, adjusted \(R^2\) is the disciplined version that insists on efficiency—explaining variance, yes, but doing so responsibly, without inflating the sense of achievement by smuggling in unnecessary variables.





Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Welcome!
  • On Defeating Nondeterminism in LLM Inference
  • Mechanistic Interpretibility Resources
  • On Mechanistic Interpretibility in Large Language Models
  • Common NLP Doubts