P
Paul
John said:Says who?
Was the door you came through to get to USENET labeled "CLUELESS
NEWBIES ENTER HERE"?
That is wrenchingly boring IMO.
Says another clueless newbie trolling for answers?
The MTBF specification might be wrong, but your justification for
the difference between the specification and the allegedly low
performance is nonsense.
Provide some authoritative citations.
A Wikipedia article, had this as the first citation. I've copied
the whole thing, because I know how you like details.
http://www.faqs.org/faqs/arch-storage/part2/section-151.html
In particular, look at the first sentence in the third paragraph.
In terms of the "bathtub curve", MTBF is computed based on the
flat section in the middle of the "bathtub", and is not based
on the wearout phase (end of the bathtub). So "Ohaya" is right.
The MTBF doesn't take the wearout mechanism into account. It is
basically taking all the non-wearout failure reasons into account.
We know if you do enough writes (like leave a write benchmark
running by accident), it will wear out and die on you, and
in an interval much shorter than the MTBF quoted.
*******
M T B F
In order to understand MTBF (Mean Time Between Failures) it is best to
start with something else -- something for which it is easier to
develop an intuitive feel. This other concept is failure rate which
is, not surprisingly, the average (mean) rate at which things fail. A
"thing" could be a component, an assembly, or a whole system. Some
things -- rocks, for example -- are accepted to have very low failure
rates while others -- British sports cars, for example -- are (or
should be) expected to have relatively high failure rates.
It is generally accepted among reliability specialists (and you,
therefore, must not question it) that a thing's failure rate isn't
constant, but generally goes through three phases over a thing's
lifetime. In the first phase the failure rate is relatively high, but
decreases over time -- this is called the "infant mortality" phase
(sensitive guys these reliability specialists). In the second phase
the failure rate is low and essentially constant -- this is
(imaginatively) called the "constant failure rate" phase. In the
third phase the failure rate begins increasing again, often quite
rapidly, -- this is called the "wearout" phase. The reliability
specialists noticed that when plotted as a function of time the
failure rate resembled a familiar bathroom appliance -- but they
called it a "bathtub" curve anyway. The units of failure rate are
failures per unit of "thing-time"; e.g. failures per machine-hour or
failures per system-year.
What, you may ask, does all this have to do with MTBF? MTBF is the <------- Note!
inverse of the failure rate in the constant failure rate phase.
Nothing more and nothing less. The units of MTBF are (or, should be)
units of "thing-time" per failure; e.g. machine-hours per failure or
system-years per failure but the "thing" part and the "per failure"
part are almost always omitted to enhance the mystique and confusion
and to make MTBF appear to have the units of "time" which it doesn't.
We will bow to the convention of speaking of MTBF in hours or years --
but we all know what we really mean.
What does MTBF have to do with lifetime? Nothing at all! It is not
at all unusual for things to have MTBF's which significantly exceed
their lifetime as defined by wearout -- in fact, you know many such
things. A "thirty-something" American (well within his constant
failure rate phase) has a failure (death) rate of about 1.1 deaths per
1000 person-years and, therefore, has an MTBF of 900 years (of course
its really 900 person-years per death). Even the best ones, however,
wear out long before that.
This example points out one other important characteristic of MTBF --
it is an ensemble characteristic which applies to populations (i.e.
"lots") of things; not a sample characteristic which applies to one
specific thing. In the good old days when failure rates were
relatively high (and, therefore, MTBF relatively low) this
characteristic of MTBF was a curiosity which created lively (?) debate
at conventions of reliability specialists (them) but otherwise didn't
unduly bother right-thinking people (us). Things, however, have
changed. For many systems of interest today the required failure
rates are so low that the MTBF substantially exceeds the lifetime
(obviously nature had this right a long time ago). In these cases
MTBF's are not only "not necessarily" sample characteristics, but are
"necessarily not" sample characteristics. In the terms of the
reliability cognoscenti, failure processes are not ergodic (i.e. you
can't blithely trade population statistics for time statistics). The
key implication of this essential characteristic of MTBF is that it
can only be determined from populations and it should only be applied
to populations.
MTBF is, therefore an excellent characteristic for determining how
many spare hard drives are needed to support 1000 PC's, but a poor
characteristic for guiding you on when you should change your hard
drive to avoid a crash.
MTBF's are best determined from large populations. How large? From
every point of view (theoretical, practical, statistical) but cost,
the answer is "the larger, the better". There are, however, well
established techniques for planning and conducting test programs to
develop specified levels of confidence in a thing's MTBF.
Establishing an MTBF at the 80% confidence level, for example, is
clearly better, but much more difficult and expensive, than doing it
at a 60% confidence level. As an example, a test designed to
demonstrate a thing's MTBF at the 80% confidence level, requires a
total thing-time of 160% of the MTBF if it can be conducted with no
failures. You don't want to know how much thing-time is required to
achieve reasonable confidence levels if any failures occur during the
test.
What, by the way, is "thing-time"? An important subtlety is that
"thing-time" isn't "clock time" (unless, of course, your thing is a
clock). The question of how to compute "thing-time" is a critical one
in reliability engineering. For some things (e.g. living thing) time
always counts but for others the passage of "thing-time" may be highly
dependent upon the state of the thing. Various ad hoc time
corrections (such as "power on hours" (POH)) have been used, primarily
in the electronics area. There is significant evidence that, in the
mechanical area "thing-time" is much more related to activity rate
than it is to clock time. Measures such as "Mean Cycles Between
Failures (MCBF)" are becoming accepted as more accurate ways to assess
the "duty cycle effect". Well-founded, if heuristic, techniques have
been developed for combining MCBF and MTBF effects for systems in
which the average activity rate is known.
MTBF need not, then be "Mysterious Time Between Failures" or
"Misleading Time Between Failures", but an important system
characteristic which can help to quantify the suitability of a system
for a potential application. While rising demands on system integrity
may make this characteristic seem "unnatural", remember you live in a
country of 250 million 9- million-hour MTBF people!
Kevin C. Daly
President
ATL Products
*******
If you need a nice graph, to illustrate the above FAQ, look at
the picture here.
http://www.quanterion.com/FAQ/Bathtub_Curve.htm
And as mentioned in the above article, a typical usage of MTBF,
is computing how many spares to buy. We used to work out the
MTBF for our products, and if a customer asked, we could tell
them "you should stock 5% more of this PCB, due to its
computed MTBF". Those 5% of units would sit in the repair room
storage cabinet, ready to be inserted into a system to fix it.
Having them sitting in the storage cabinet, means a system
could be run non-stop (the items in the storage cabinet, can be
inserted into the system while the power was on - hot insertable).
The prediction was important, when some of the circuits involved
cost $100,000 a piece. A customer would be rightly pissed, if
the estimate was way off, in either direction. So that is what
you're supposed to use an MTBF for. A customer decision to buy
a system, could well be influenced by the added cost of spares
the customer is expected to stock.
HTH,
Paul