How to Trap Rogue Data ?

  • Thread starter Thread starter bill..
  • Start date Start date
B

bill..

I have an application that generates hourly system performance
logfiles which I graph to look for long term trending.
The metric I use gradually varies from 1% to about 15% depending on
various external factors - such as time of day and day of week.

My problem is that the logfiles sometime hiccup and generate bad data
resulting is huge spikes in my curve. I have trapped for the big ones
20% in my source data but I need something smarter so I can catch
large deviations from the curve.

Unfortunately I do not have the option to fix the application that
generated the bad data.


Are there any statistical function that I can use to look for such
deviations?

Something that would allow me to toss any data over 5% from the trend
line would be perfect.

eg

value
1.2
1.9
2.4
3.1
2.6
11.3 toss this one using NA() since it is way off the curve
3.4
4.6
6.3
7.5
9.3
11.3 keep this one since it is not too far off the curve
8.8






Any suggestions?

Thanks

Bill
 
Bill -

Not too sophisticated, but what I've sometimes done is run a sort of
moving average in the next column, of the two or three points above and
below the row, and in the column after that only include data that
varies by less than X from the averages. It's easier than a trendline,
and if your spikes are one point wide, it works fine.

- Jon
-------
Jon Peltier, Microsoft Excel MVP
Peltier Technical Services
Tutorials and Custom Solutions
http://PeltierTech.com/
_______
 
bill.. said:
I have an application that generates hourly system performance
logfiles which I graph to look for long term trending.
The metric I use gradually varies from 1% to about 15% depending on
various external factors - such as time of day and day of week.

My problem is that the logfiles sometime hiccup and generate bad data
resulting is huge spikes in my curve. I have trapped for the big ones
large deviations from the curve.

Unfortunately I do not have the option to fix the application that
generated the bad data.


Are there any statistical function that I can use to look for such
deviations?

If you are sure they are definitely limited to single spikes on a
nominal but noisy smooth trend line then the local second derivative
estimator is a reasonable test. Set a threshold on that to decide on
rogue points.

ABS(x[i-1]+x[i+1]-2x)

You really need to be sure that they *are* rogue though. The test has no
way of knowing what you really intend. An alternative is to use 1-Norm
fitting which will safely ignore modest numbers of rogue points.
Something that would allow me to toss any data over 5% from the trend
line would be perfect.

eg

value
1.2
1.9
2.4
3.1
2.6
11.3 toss this one using NA() since it is way off the curve
3.4
4.6
6.3
7.5
9.3
11.3 keep this one since it is not too far off the curve
8.8

I prefer my plots with all noise displayed and to fix the problem at
source. You never know when sensor or equipment failure might produce
real spikes in the signal that a filter will helpfully throw away.

Regards,
 
Thanks for the suggestions

I understand averaging the previous annd next points but what is
1-Norm fitting?


Bill

bill.. said:
I have an application that generates hourly system performance
logfiles which I graph to look for long term trending.
The metric I use gradually varies from 1% to about 15% depending on
various external factors - such as time of day and day of week.

My problem is that the logfiles sometime hiccup and generate bad data
resulting is huge spikes in my curve. I have trapped for the big ones
large deviations from the curve.

Unfortunately I do not have the option to fix the application that
generated the bad data.


Are there any statistical function that I can use to look for such
deviations?

If you are sure they are definitely limited to single spikes on a
nominal but noisy smooth trend line then the local second derivative
estimator is a reasonable test. Set a threshold on that to decide on
rogue points.

ABS(x[i-1]+x[i+1]-2x)

You really need to be sure that they *are* rogue though. The test has no
way of knowing what you really intend. An alternative is to use 1-Norm
fitting which will safely ignore modest numbers of rogue points.
Something that would allow me to toss any data over 5% from the trend
line would be perfect.

eg

value
1.2
1.9
2.4
3.1
2.6
11.3 toss this one using NA() since it is way off the curve
3.4
4.6
6.3
7.5
9.3
11.3 keep this one since it is not too far off the curve
8.8

I prefer my plots with all noise displayed and to fix the problem at
source. You never know when sensor or equipment failure might produce
real spikes in the signal that a filter will helpfully throw away.

Regards,
 
bill.. said:
Thanks for the suggestions

I understand averaging the previous annd next points but what is
1-Norm fitting?

Classical least squares minimises sum of squares of residuals. = 2-Norm

( y - F(x)) ^ 2

1-Norm - minimises sum of absolute magnitude of residuals

| y - F(x |

For the simple case of a small dataset.

2-Norm estimator: Mean = average of all N data values

1-Norm estimator Median = mid point of data histogram

The same trick can be applied to polynomial fitting with robust
statistics. It is much better at ignoring modest numbers of erratic
outliers in noisy data.

Regards,
 
Back
Top