R squared not the same in graph and spread sheet.

  • Thread starter Thread starter James Silverton
  • Start date Start date
J

James Silverton

I have an X-Y plot displayed with a trend line equation and R squared. I
also calculate R squared (RSQ) and the numbers are not the same; 0.8918
and 0.928. Can someone explain this (Excel 2010 and Windows 7)?
 
James Silverton said:
I have an X-Y plot displayed with a trend line equation
and R squared. I also calculate R squared (RSQ) and the
numbers are not the same; 0.8918 and 0.928. Can someone
explain this (Excel 2010 and Windows 7)?

You would need to post [1] more information for a dispositive explanation.
But generally, there might be two causes for such a large difference:

1. You used RSQ with the original X-Y data points in the worksheet instead
of with trendline X-Y data points.

2. If you used RSQ with trendline X-Y data points that you generated in the
worksheet, you might have used rounded coefficients as they appear in the
trendline equation instead of "exact" coefficients.

The remedy to #2 is to recover the "exact" coefficients in one of two ways.
The quick-and-dirty way is to format the trendline equation so that
coefficients are displayed with the Scientific format with 14 decimal
places. Then copy-and-pasted each coefficient into the worksheet. The
better method is to use LINEST to calculate the coefficients in the
worksheet; or you might be able to use FORECAST, TREND or another similar
function. We need more information in order to tell you exactly how to do
that.
 
James Silverton said:
I have an X-Y plot displayed with a trend line equation
and R squared. I also calculate R squared (RSQ) and the
numbers are not the same; 0.8918 and 0.928. Can someone
explain this (Excel 2010 and Windows 7)?

You would need to post [1] more information for a dispositive
explanation. But generally, there might be two causes for such a large
difference:

1. You used RSQ with the original X-Y data points in the worksheet
instead of with trendline X-Y data points.

2. If you used RSQ with trendline X-Y data points that you generated in
the worksheet, you might have used rounded coefficients as they appear
in the trendline equation instead of "exact" coefficients.

The remedy to #2 is to recover the "exact" coefficients in one of two
ways. The quick-and-dirty way is to format the trendline equation so
that coefficients are displayed with the Scientific format with 14
decimal places. Then copy-and-pasted each coefficient into the
worksheet. The better method is to use LINEST to calculate the
coefficients in the worksheet; or you might be able to use FORECAST,
TREND or another similar function. We need more information in order to
tell you exactly how to do that.

Here are the data
X Y
2 1.55
3 1.57
4 1.79
5 1.59
6 1.84
7 3.14
8 2.77
9 3.01
10 2.5
12 2.87
13 4.01
14 4.53
15 4.59
16 4.64
17 4.54
18 5.44
19 4.84
20 4.67
21 4.41
22 4.36
23 5.29
24 5.4
25 6.94
26 6.9
27 7.11
28 7.34
29 7.9
30 7.16
31 8.28
32 7.39
33 6.81

Is the fact that I am using straight lines to connect the points in the
graph of importance?
 
Ron Rosenfeld said:
On Thu, 16 Feb 2012 13:22:56 -0500, James Silverton
FYI, in Excel 2007, I get the same values for RSQ and
displayed as r^2 on the chart (0.921914019817796)
I plotted an x-y scatter chart, linear trendline

RSQ of the original data and R^2 for the linear trendline should always be
about the same, since RSQ is essentially used to construct the linear
trendline. See http://en.wikipedia.org/wiki/Simple_regression.

Some infinitesimal difference is possible due to floating-point anomalies.
The degree of difference is likely to vary depending on the number of data
of points and the range of the magnitude of the values.
 
James Silverton said:
Here are the data [....]
Is the fact that I am using straight lines to connect
the points in the graph of importance?

No. But the type of trendline is. You neglect to mention it.

We would expect RSQ of the original data and R^2 of the trendline to be
about the same only if you use a __linear__ trendline. Some relatively
small difference is possible due to floating-point anomalies; that is, due
to the way that Excel represents numbers internally.

But based on the difference you noted previously -- 0.8918 for one, 0.928
for the other, not clear which is which -- I suspect you used an
__exponential__ trendline.

For the data that you posted, RSQ returns 0.921914019817796, and R^2 for the
exponential trendline is 0.894858644182441.

Note that the two R-squares are relatively close to what you mentioned
previously. I presume the difference is because you posted rounded data as
they are displayed and the underlying actual values are slightly different.
Or because you posted different data altogether.

In any case, assuming you used an exponential trendline of the form y =
c*e^(b*x), you can compute the trendline coefficients b and c in Excel as
follows:

B1: =LINEST(LN(Y1:Y31),X1:X31)
C1: =EXP(INDEX(LINEST(LN(Y1:Y31),X1:X31),1,2))

Then you can compute the corresponding data points along the trendline by
putting the following formula into Z1 and copying down:

Z1: =$C$1*EXP($B$1*X1)

Then R^2 might be computed by the formula =RSQ(Y1:Y31,Z1:Z31). Note that
RSQ(knownY,knownX) should not be taken literally to mean Y-axis and X-axis
data.

However, RSQ returns 0.997576649432384, not 0.894858644182441.

I believe the explanation is: there are several definitions of R-squared.
The LINEST help page explains how R-squared is computed for LINEST.
Off-hand, it seems different from RSQ. (The two equations may or may not be
mathematically equivalent for linear trendlines. TBD.)

Frankly, I cannot duplicate the trendline R^2 even using the equations in
the LINEST help page. Perhaps I am making some dumb mistakes (TBD).

But you can recover the LINEST R^2 with the following formula:

=INDEX(LINEST(LN(B1:B31),A1:A31,,TRUE),3,1)

That returns 0.894858644182439. The infinitesimal difference (2E-15) is
probably due to floating-point anomalies in the apparently different
implementations of the LINEST and trendline R^2 algorithms.

Not exactly a dispositive explanation (yet). But at least I've given you
the LINEST formulas that you can use to duplicate the chart trendline.

For the coefficients of other types of trendlines, see
http://spreadsheetpage.com/index.php/tip/chart_trendline_formulas.
 
joeu2004 said:
Then R^2 might be computed by the formula =RSQ(Y1:Y31,Z1:Z31).
[....]
Note that
However, RSQ returns 0.997576649432384, not 0.894858644182441.

__That__ was a dumb mistake -- a typo in some formulas. That RSQ formula
returns 0.879710250121746.

Still different from R^2 returns by LINEST and displayed for the exponential
trendline.

But much closer.
 
Ron Rosenfeld said:
Oh, I only posted to indicate to the OP that I was not
seeing a difference, using his data and Excel 2007.

But Jim did not say he used a linear trendline, IMHO.

Jim wrote: "Is the fact that I am using straight lines to connect the
points in the
graph of importance?".

I suspect you interpreted that to refer a linear trendline.

I interpret that to be a description of the chart type that he used:
straight lines between data points instead of a smooth curve or no
connection at all.

I believe that prior to Excel 2007, MS was using a different,
and inferior, algorithm for LINEST and the linear trendline
generator in charting.

You may or may not be correct about changes in the LINEST and/or trendline
algorithms. I don't recall.

But that seems irrelevant. Jim said he is using XL2010.

And FYI, I get the "same" results for RSQ, LINEST and the trendline using
XL2003.

(Actually, the RSQ result differs by one in the 15th significant digit.)
 
But Jim did not say he used a linear trendline, IMHO.

Jim wrote: "Is the fact that I am using straight lines to connect the
points in the
graph of importance?".

I suspect you interpreted that to refer a linear trendline.

I interpret that to be a description of the chart type that he used:
straight lines between data points instead of a smooth curve or no
connection at all.



You may or may not be correct about changes in the LINEST and/or
trendline algorithms. I don't recall.

But that seems irrelevant. Jim said he is using XL2010.

And FYI, I get the "same" results for RSQ, LINEST and the trendline
using XL2003.

(Actually, the RSQ result differs by one in the 15th significant digit.)

I will just say that I am using a linear trend line; the only type for
which I want to use R^2.

I repeated the graph and spread sheet with the numbers I supplied (in
case non-displayed places were a problem in my original spread sheet)
and the two values of R squared are closer; 0.915 and 0.9219 but I don't
see why there should be any difference if the same formula is being used.
 
PS.... I said:
Then R^2 might be computed by the formula =RSQ(Y1:Y31,Z1:Z31). [....]
However, RSQ returns 0.997576649432384, not 0.894858644182441. [....]
you can recover the LINEST R^2 with the following formula:
=INDEX(LINEST(LN(B1:B31),A1:A31,,TRUE),3,1)

First, some errata.... In keeping with the other cell assignments that I
used, that should be written as:

=INDEX(LINEST(LN(Y1:Y31),X1:X31,,TRUE),3,1)

Second, according to response #8 in
http://www.ozgrid.com/forum/showthread.php?t=151617, R^2 should be computed
by:

=RSQ(LN(Y1:Y31),X1:X31)

Although that is indeed the same as the value returned by LINEST, I will
need to cogitate on it to understand why it should be.

Well, clearly LINEST is using the "correlation coefficient" definition of
R^2, not the "coefficient of determination" definition of R^2 described in
the LINEST help text.

I am not familiar with the mathematical and conceptual differences between
the two.
 
James Silverton said:
I will just say that I am using a linear trend line;
the only type for which I want to use R^2.

Okay, GIGO. For the future, learn to state all the relevant facts
correctly.

I repeated the graph and spread sheet with the numbers
I supplied (in case non-displayed places were a problem
in my original spread sheet) and the two values of R squared
are closer; 0.915 and 0.9219 but I don't see why there
should be any difference if the same formula is being used.

If you are saying that you copy-and-paste the numbers you posted into an
Excel worksheet, as Ron and I had to do, and you get those two different
numbers, I have no idea what mistake you are making.

Both Ron and I get the same results. That is, RSQ,
INDEX(LINEST(Y1:Y31,X1:X31),3,1) and the trenadline equation (formatted as
Scientific with 14 decimal places) all return the same result at least to
the 14th significant digit.

I am using XL2010 as you say you are. I am using an Intel-compatible
computer (in fact, an Intel computer), as I ass-u-me you are.

IMHO, the only way to unravel the problem is for you to upload an example
Excel file to a file-sharing website and post the URL of the shared uploaded
file here. Here is a list of some free file-sharing websites; or use your
own.

Box.Net: http://www.box.net/files
Windows Live Skydrive: http://skydrive.live.com
MediaFire: http://www.mediafire.com
FileFactory: http://www.filefactory.com
FileSavr: http://www.filesavr.com
FileDropper: http://www.filedropper.com
RapidShare: http://www.rapidshare.com
 
For posterity, albeit irrelevant now that Jim has clarified his
situation....

I wrote previously:
Then R^2 might be computed by the formula =RSQ(Y1:Y31,Z1:Z31).
Note that RSQ(knownY,knownX) should not be taken literally
to mean Y-axis and X-axis data.

Wrong! RSQ is intended to be a measure of __linear_dependence__ between two
sets of data, knownY and knownX. In other words, there should be a linear
relationship between the two. So knownY and knownX should indeed by the
Y-axis and X-axis coordinates of some function. That is also the sense of
R^2 displayed with a chart trendline.

In contrast, the "coeffiecient of determination", also called R^2 :-(, is a
measure of fit between actual and predicted data. In Excel, it might be
written as =1 - SUMXMY2(Y1:Y31,Z1:Z31)/DEVSQ(Y1:Y31). However, I am not
sure if the theory behind the "coefficient of determination" permits it to
be used a measure of fit between the actual data and an exponential
trendline.

Second, according to response #8 in
http://www.ozgrid.com/forum/showthread.php?t=151617, R^2 should be
computed by:
=RSQ(LN(Y1:Y31),X1:X31)

Note that LN(Y1:Y31) is the log of the exponential trendline given by y =
c*exp(m*x), where c = exp(b). Thus, LINEST(LN(Y1:Y31),X1:X31) is the
__linear__ function ln(y) = m*x + b. So RSQ(LN(Y1:Y31),X1:X31) is a measure
of the linear dependence between ln(y) and x. And by commutativity, it is a
measure of the exponential dependence between y and x.

(To be consistent with LOGEST documentation, that I use "m" where "b" was
used previously.)
 
For posterity, albeit irrelevant now that Jim has clarified his
situation....

I wrote previously:

Wrong! RSQ is intended to be a measure of __linear_dependence__ between
two sets of data, knownY and knownX. In other words, there should be a
linear relationship between the two. So knownY and knownX should indeed
by the Y-axis and X-axis coordinates of some function. That is also the
sense of R^2 displayed with a chart trendline.

In contrast, the "coeffiecient of determination", also called R^2 :-(,
is a measure of fit between actual and predicted data. In Excel, it
might be written as =1 - SUMXMY2(Y1:Y31,Z1:Z31)/DEVSQ(Y1:Y31). However,
I am not sure if the theory behind the "coefficient of determination"
permits it to be used a measure of fit between the actual data and an
exponential trendline.



Note that LN(Y1:Y31) is the log of the exponential trendline given by y
= c*exp(m*x), where c = exp(b). Thus, LINEST(LN(Y1:Y31),X1:X31) is the
__linear__ function ln(y) = m*x + b. So RSQ(LN(Y1:Y31),X1:X31) is a
measure of the linear dependence between ln(y) and x. And by
commutativity, it is a measure of the exponential dependence between y
and x.

(To be consistent with LOGEST documentation, that I use "m" where "b"
was used previously.)
Thanks to everyone and apologies for wasting time. I now know that the
points I had selected for the graph were not the same as those used for
calculating R^2 in the spreadsheet. I'm not quite sure how I did it but
I think I was still attuned to methods I had used in Excel 2003 when I
first set up the spread sheet. The numbers I posted to the ng do give
R^2 = 0.9219 both on the chart and on the spreadsheet.
 
Is the fact that I am using straight lines to connect the points in the
graph of importance?

Are you forcing the regression to go through the origin? That makes
R² into nonsense.
 
Are you forcing the regression to go through the origin? That makes
R² into nonsense.
I think I have been pointed to the reason for my troubles, see earlier,
but no, I don't force the linear regression to go thro the origin. I
don't actually use R^2 in any good statistical sense except as an
indicator of variation.
 
Back
Top