MTBF: A bunch of bs

  • Thread starter Thread starter John
  • Start date Start date
J

John

MTBF of 1.5 million hours is a bit unrealistic isn't it?

Are there any surveys of data cabinets that gives realistic failure rates?
For example, I can say that three drives of four hundred in our data racks
failed in the past five years. Five power supplies failed.
 
John said:
MTBF of 1.5 million hours is a bit unrealistic isn't it?

Are there any surveys of data cabinets that gives realistic failure rates?
For example, I can say that three drives of four hundred in our data racks
failed in the past five years. Five power supplies failed.

Your example is 2.2 million hours. What is the issue?
 
Previously John said:
MTBF of 1.5 million hours is a bit unrealistic isn't it?

And why should that be the case?
Are there any surveys of data cabinets that gives realistic failure
rates? For example, I can say that three drives of four hundred in
our data racks failed in the past five years. Five power supplies
failed.

Assuming that none of the drives was past its "component life"
(after which the MTBF does not hold anymore) and that they were
all operated under the standard conditions used to measure the
MTBF, that gives you 5.8 Million hours MTBF for the drives. I
assume they are actually operated under better conditions than
the MTBF requires, e.g. better cooling.

For the PSUs I cannot give you a number since you do not say how many
are there. But you should know that a) PSUs have typically 2 years
component life, not 5 years, and b) their MTBF is often specified at
<70% load.

Arno
 
craigm said:
Your example is 2.2 million hours. What is the issue?

MTBF is the batting average of computers. See: Tony Gwinn.

Anyway, are there better statistics compiled? A non-brochure format
possibly?
 
John said:
MTBF is the batting average of computers. See: Tony Gwinn.

Anyway, are there better statistics compiled?

Planning to buy a five year old drive, are you ?
 
Arno said:
And why should that be the case?

Because he didn't consider the "M" part of MTBF :)

"Mean" here means "average of large numbers". So for the MTBF to have any
real meaning, you need to have large numbers of hours (within the life
time, as Arno says).

Imagine that 1M5 h is an average. To be of any significance, you need to
have much more than that, say 15M h (within the life time, say 5 y). 5 y
are about 44k h. This means that the MTBF starts to become helpful as
reliability data when you have some 350 units.

You have 400 drives, which gives you some 17M5 h in 5 y. With the 3
failures you had, this gives you an MTBF for your drives of 5M8 h.

When talking about cabinets, you need to consider the number (and of course
brand and model) of drives and the power supply. But the 1M5 h for a data
cabinet is not exactly incompatible with your 5M8 h for a single drive.

For me, with my 8 or so drives here under vastly varying conditions,
knowing that the MTBF is 5M8 h or 1M5 h or whatever is not that helpful --
it just tells me that all of the drives may fail sooner or later :)

Gerhard
 
MTBF is the batting average of computers. See: Tony Gwinn.
Anyway, are there better statistics compiled? A non-brochure format
possibly?

It is not easy to determine good numbers in the first place.
After all you cannot run, say, 100 of these cabinets for 5 years
to get a good meausrement. HDD manufacturers manage relatively
exact numbers because they have a lot of experience with their
designs.

For PSUs, it depends on what you want. If you want to minimise
downtime, go for redundant PSUs that are hot-swapable. If you want
to get bets value for your money, you need luck or intuition.

Arno
 
Because he didn't consider the "M" part of MTBF :)
"Mean" here means "average of large numbers". So for the MTBF to have any
real meaning, you need to have large numbers of hours (within the life
time, as Arno says).

Imagine that 1M5 h is an average. To be of any significance, you need to
have much more than that, say 15M h (within the life time, say 5 y). 5 y
are about 44k h. This means that the MTBF starts to become helpful as
reliability data when you have some 350 units.

Not exactly. MTBF is as important for a single drive as for a thousand
drives. It does not change the fact that every single one of them might fail
with the same probability.

Number of drives matters only when you want to measure MTBF. Not when you
want to use it.

For a small number of drives or a short operation period you can calculate a
probability of failure (if you have MTBF); instead of number of failures.
 
Peter said:
Not exactly. MTBF is as important for a single drive as for a thousand
drives. It does not change the fact that every single one of them might
fail with the same probability.

Number of drives matters only when you want to measure MTBF. Not when
you want to use it.

For a small number of drives or a short operation period you can
calculate a probability of failure (if you have MTBF); instead of number
of failures.

This is true, but it still doesn't help me. The only thing it tells me is
that the drives may fail sooner or later -- for which I don't need the MTBF
:)

There's no way around it: probability only makes sense with big numbers.
Let's say you have a probability of 1% for a drive failure over 5 years
(equivalent to 4M4 h MTBF). With 8 drives, you know that it's more likely
that none will fail during that period, but still one or more may fail. Not
very helpful... With 1000 drives, you know that about 10 failures is a
likely result. Much more helpful, relatively speaking.

Of course, in the /long run/ I can make use of the numbers with fewer
drives. But that only replaces the high number of hours achieved
concurrently with many drives by a high number of hours achieved by using
many drives consecutively.

Gerhard
 
Peter said:
Not exactly. MTBF is as important for a single drive as for a thousand
drives. It does not change the fact that every single one of them might
fail with the same probability.

Number of drives matters only when you want to measure MTBF. Not when you
want to use it.

For a small number of drives or a short operation period you can calculate
a probability of failure (if you have MTBF); instead of number of
failures.

Fine, having calculated that probability, tell me how you would actually
apply that information, in a site with a single drive.
 
This is true, but it still doesn't help me. The only thing it tells me is
that the drives may fail sooner or later -- for which I don't need the MTBF
:)

Maybe it doesn't help YOU.
If I had to chose between otherwise identical drives, one with 300,000 MTBF,
other with 1,500,000 MTBF; I would pick a second one. Even if that would be
for a single hard disk in my PC.
There's no way around it: probability only makes sense with big numbers.
Let's say you have a probability of 1% for a drive failure over 5 years
(equivalent to 4M4 h MTBF). With 8 drives, you know that it's more likely
that none will fail during that period, but still one or more may fail. Not
very helpful... With 1000 drives, you know that about 10 failures is a
likely result. Much more helpful, relatively speaking.

Of course, in the /long run/ I can make use of the numbers with fewer
drives. But that only replaces the high number of hours achieved
concurrently with many drives by a high number of hours achieved by using
many drives consecutively.

It seems that you have a different understanding of probability.
 
Peter said:

Which just tells how to calculate the probability, and tells you nothing
whatsoever about the practical utility of that information.

So, again, how would you actually apply it? What actions would you take as
a result of having that information that you would not take if all that you
knew was that there was a small chance that your one drive would fail
without knowing the number to assign to that chance?
 
Peter said:
Maybe it doesn't help YOU.
If I had to chose between otherwise identical drives, one with 300,000 MTBF,
other with 1,500,000 MTBF; I would pick a second one. Even if that would be
for a single hard disk in my PC.

Right. I have expressed myself incorrectly. Of course it may help for the
purchase decision, but what does it help you for the time you are using the
drive? (It definitely helps the person in charge of 1000 drives, for
example for the decision about how many drives to have in stock for
replacements.)
It seems that you have a different understanding of probability.

Different from what other understanding?

Gerhard
 
Which just tells how to calculate the probability, and tells you nothing
whatsoever about the practical utility of that information.

So, again, how would you actually apply it? What actions would you take as
a result of having that information that you would not take if all that you
knew was that there was a small chance that your one drive would fail
without knowing the number to assign to that chance?

So you think they publish relability information just to satisfy academic
scientists?

Every component failure potentially creates a need to repair failed service
(which relies on that component). There are various costs associated with
failure events. Like cost of the repair itself and lost revenue as a result
of failure or downtime. How those costs are calculated, depends on
particular case. They can vary from a very small to a huge sum. Now, you can
apply calculated probability to those costs and come up with average cost
related to failures. Compare that with a component price, and make a
purchase decision.
 
Maybe it doesn't help YOU.
Right. I have expressed myself incorrectly. Of course it may help for the
purchase decision, but what does it help you for the time you are using the
drive? (It definitely helps the person in charge of 1000 drives, for
example for the decision about how many drives to have in stock for
replacements.)

Once you have it, it helps you to find out how much it will cost you to do a
maintenance for it.
Different from what other understanding?

That probability of a failure applies the same to a single as to a huge
number of components.
 
Peter said:
So you think they publish relability information just to satisfy academic
scientists?

Want me to hold your coat while you thrash that straw man?
Every component failure potentially creates a need to repair failed
service (which relies on that component). There are various costs
associated with failure events. Like cost of the repair itself and lost
revenue as a result of failure or downtime. How those costs are
calculated, depends on particular case. They can vary from a very small to
a huge sum. Now, you can apply calculated probability to those costs and
come up with average cost related to failures. Compare that with a
component price, and make a purchase decision.

So give us a demonstration of this calculation for the case of a _single_
drive in a _single_ machine.
 
Which just tells how to calculate the probability, and tells you
nothing
Want me to hold your coat while you thrash that straw man?


So give us a demonstration of this calculation for the case of a _single_
drive in a _single_ machine.

In that case DIY.
 
Back
Top