When we first learned about the correlation coefficient,
, we focused on what it meant rather than how to calculate it, since the computations are lengthy and computers usually take care of them for us.
We'll do the same with
and concentrate on how to interpret what it means.
In a way,
measures how much prediction error is eliminated when we use least-squares regression.
Predicting without regression
We use linear regression to predict
given some value of . But suppose that we had to predict a value without a corresponding value.
Without using regression on the
variable, our most reasonable estimate would be to simply predict the average of the values.
Here's an example, where the prediction line is simply the mean of the
Notice that this line doesn't seem to fit the data very well. One way to measure the fit of the line is to calculate the sum of the squared residuals—this gives us an overall sense of how much prediction error a given model has.
So without least-squares regression, our sum of squares is
Would using least-squares regression reduce the amount of prediction error? If so, by how much? Let's see!
Predicting with regression
Here's the same data with the corresponding least-squares regression line and summary statistics:
This line seems to fit the data pretty well, but to measure how much better it fits, we can look again at the sum of the squared residuals:
Using least-squares regression reduced the sum of the squared residuals from
So using least-squares regression eliminated a considerable amount of prediction error. How much though?
R-squared measures how much prediction error we eliminated
Without using regression, our model had an overall sum of squares of
. Using least-squares regression reduced that down to .
So the total reduction there is
We can represent this reduction as a percentage of the original amount of prediction error:
If you look back up above, you'll see that
R-squared tells us what percent of the prediction error in the
variable is eliminated when we use least-squares regression on the variable.
As a result,
is also called the coefficient of determination.
Many formal definitions say that
tells us what percent of the variability in the variable is accounted for by the regression on the variable.
It seems pretty remarkable that simply squaring
gives us this measurement. Proving this relationship between and is pretty complex, and is beyond the scope of an introductory statistics course.