Z score on existing data

  • Thread starter Thread starter jimbo
  • Start date Start date
J

jimbo

I have 12 data points representing 12 months of data for each of 100
people. I'm looking for outliers. So I set up a high z score, max-mean/
SD, and a low z, mean-min/SD. When I look at my table, my z scores
look low; I see clear outliers that have a reletively low z.

I think that this is due to the outlier being compared to a set of
data that it is already part of, contributing both to increasing the
denominator as well as lowering the numerator by being part of the
mean. If I were to compare a new value to the set of twelve data
points, the z would be accurate, but I somehow have to take the
outlier out of the group and then recalculate the z and I have to
check for both low and high.

With a lot of rows of data, doing this manually is impossible. I can't
figure out how to do it in excel either. Am I using the wrong approach
or is there a way, or a mathematical compensation that makes up for
the high and low outlier being part of the mean and SD?
 
jimbo said:
I somehow have to take the outlier out of the group
and then recalculate the z and I have to check for
both low and high. [....]
I can't figure out how to do it in excel either.
Am I using the wrong approach or is there a way, or a
mathematical compensation that makes up for the high
and low outlier being part of the mean and SD?

I have used a similar iterative approach when eliminating outliers. AFAIK,
there is no "mathematical compensation" factor. You simply need to exclude
outliers from the (re)computation of the mean and SD.

But there are tricks that you might be able to use to make it easier to do
in Excel.

One approach to eliminating outliers is to convert them to text or to clear
their cell. Note that Excel functions like AVERAGE and STDEV ignore text
and empty cells. Therefore, you do not need to adjust their parameters.

For example, if D90 contains the number 123 and you determine that is an
outlier, change the cell to '123 (note the apostrophe) and change the
horizontal format to right-alignment. Or simply clear D90.

You can still use AVERAGE(A1:L100) and STDEV(A1:L100). The mean and SD are
adjusted automatically, and the identification of subsequent outliers will
be based on the new mean and SD.

Note-1: If you change 123 to text, it would be prudent to change to the
font or cell fill color or otherwise identify the data as an excluded
outlier.

Note-2: Simply changing the cell format to Text alone does not cause
numeric constants to be converted to text, although it might change the
horizontal alignment to be similar to text.

Alternatively, it might be tempting to simply change the font or cell fill
color; that is easy to do automatically with a Condition Format. Then you
might use array-entered AVERAGE and STDEV formulas to ignore the colored
outliers.

However, AFAIK, there is no Excel function that could be used in a
conditional expression to recognized and ignore colored outliers. In
particular, CELLS("color") can__not__ be used for that purpose.

But you could write your own VBA function to recognize colored outliers.
See http://www.cpearson.com/excel/colors.aspx for details and ideas. Then
the mean and SD might be computed by the following array-entered formulas
(press ctrl+shift+Enter instead of just Enter):

=AVERAGE(IF(MYCOLOR(A1:L100)=0,A1:L100))
=STDEV(IF(MYCOLOR(A1:L100)=0,A1:L100))

Although I believe the __elimination__ of outliers (in contrast to their
__identification__) should be a manual procedure, you could implement the
iterative process in a macro.


Some unsolicited comments....

jimbo said:
I have 12 data points representing 12 months of data for
each of 100 people. I'm looking for outliers. So I set
up a high z score, max-mean/SD, and a low z, mean-min/SD.
When I look at my table, my z scores look low; I see clear
outliers that have a reletively low z.

Using the z-score to identify outliers is appropriate if the data are
normally distributed, or if you have good reason to believe a large
population of data would be.

Otherwise, you might consider alternative criteria, such as the
interquartile range (IQR).

In any case, "12 data points" is not much data for computing statistics and
identifying outliers. I hope you mean that the mean and SD are based on
1200 data points, which you might group by month for presentation.
 
Back
Top