15K rpm SCSI-disk

Curious George · Dec 6, 2004

On Tue, 30 Nov 2004 17:42:03 +0100, "Joris Dobbelsteen"

What management, just install the array and you are done. It works just like
a normal disk (except for setting up the array once).

One of the main points of raid is to be proactive about failures &
uptime. So I'm talking about things like being able see SMART/PFA
status and have the array move to a hot spare when SMART fails.
Providing notification about such an event (like SNMP traps, email,
pages, popup). Also initiating a process to validate the data against
the ecc data & check the media surface. Being able to reconfigure or
upgrade or provide info about the array without taking the whole
machine down. Being able to automate & schedule these upgrades or
validation checks for a more convenient time (so there is no
discernable performance hit). All this may seem like overkill, but it
really isn't if you want to get the full benefits of non-zero raid. I
don't see RAID0 as a viable choice for most scenarios. I also assume
that if you are looking to firmware raid (even ROMB, etc) you have
higher expectations than the quick and dirty OS software striped set.

With some controllers you might get in trouble when you use different disks,
so use the same brand AND model.

AND model revisions

That's all more of an issue with ATA raid for a number or reasons esp
not being as configurable as scsi drives as well as firmware
limitations.

Recovery: RAID1: turn of the system, remove the defective drive and replace
it. Turn on, repair the array, wait a few seconds for the disk copy and
done.

Well that's fine for the ideal and simplest of scenarios - its not the
only one.

Todays disks are capable of relocating damaged sectors,

Yes but AFAIK not always automatically or at least not always on a
strictly hardware/low level. Also it seems to me the ideal is to have
a raid controller initiate & manage a continuous background scan of
the media and restore data from bad sectors using the redundant data
residing on good sectors rather than trying to recover/read from weak
sectors. (better ata & scsi controllers do this)

they do it all (same
reason your 128MB USB drive/memory stick only has 120 MB storage capacity).

I don't know much about memory stick architecture. It seems to me
this is more a result of file system overhead and perhaps conflicting
measurements of raw capacity (like it usually is w' storage) rather
than say "reserved space" for some low-level harware-implemented
automatic recovery process.

Please educate me if wrong.

Simply call it resource contention. For a single-user system the Raptor will
handle the resource contention better than the SCSI system.
Of course this is subject to the opinion expressed by a third party, who may
resonabily be expected to have sufficient knowledge of the system to provide
such an 'opinion'.

Could you direct us to this resource? AFAIK SATA has the scsi
protocol on top of ata - so it has greater overhead. If there is 1
disk there isn't really any contention for that resource in either
case. Because SATA is point to point there is never drive arbitration
per bus, but in multi-disk sata the overall efficiency depends more on
details of the controller(s) than the point to point design per se.
Educate me if I'm wrong.

Usually response times, throughput and storage capacity requires a
trade-off.
My trade-off would favor storage capacity over throughput over response
times.

Then I would expect you to favor the larger 7200 rpm sata drives
marketed as "Personal Storage" devices (some are quite good). The
"enterprise" raptors are similar to current 10k scsi in multiple
regards including price, capacity, & performance (& theoretical MTBF).

AFAIK the price "advantage" of raptors has more to do with a
comparison with the "need" for a "deluxe" retail boxed Adaptec
controller - not really capacity per disk or $/MB (from what I've
seen).

Throwing ATA raid0 into the mix (as you first suggested it) the trade
off variables have to be expanded to include complexity, cost & to
some extent reliability.

Indeed, but if you want luxery, you are (or someone else is if you are
lucky) going to pay for it anyway. Its just considering how much you are
willing to spend for you luxery.
However for the same luxery (or even the same essential product that you
simply need) there is a large variation of prices that you can pay.

True. But often things seem to be luxury simply because of sticker
shock. In some cases when time is very valuable products that bring
even small increases in productivity or assurance of quality can bring
real value despite this initial sticker shock. This is all really
relative though & must be taken case by case.

Lets asume most chip manufacturers (NOT designers, there are only a few
manufacturers) are equally capable of making the same quality product.
Besides the mechanical parts are more likely to fail than electrical.
A very hot CPU would last for 10 years (its designed for it anyway). I
expect the same for chipsets.
I only saw electronics fail because of ESD, lightning storms and some
chemicals (e.g. from batteries).
I wouldn't consider the controller to be a major problem with disk
subsystems.

Outright and complete failure is one thing. Erratic behavior due to
poor design, overheating, damage & imminent failure is another. Yes
mechanical devices don't last as long as IC's but I think there is a
bit more to reliability than time to total failure.

Frankly I'm not that worried about an ATA raid controller dying
prematurely or before a scsi hba. I'm more concerned about a low-end
controller having limitations which interfere with the ability for the
raid to reliability deliver on its core features/promises or conflicts
or poorly written code which waste the user's/administrator's time &
eats away at the assumed cost savings.

When using 2-disk RAID 1 (NOT RAID 0): when 1 disks fails the system
continues to operate correctly, leaving you time to replace the defective
material with no loss of continuity.
One disk systems will stop working when the disk failes.

Yes but the array MTBF calculation characterizes arrays as more
complex than a single drive with more potential points of failure.

Yes when one disk drops off the other keeps going, but there is more
to it than that.
1. if both disks are the same age with the same wear they might die at
similar times so you might not have as much time as you think to get
the replacement. Failure rates occur in a "U" pattern are not linear
across time.
2. not all failures are neat and tidy

Either not so uncommon cases are real potential time/productivity
wasters which can invalidate the expected benefits of non-zero raid.

Besides recovery times for RAID1 are probably lower than for one-disk
systems.

I don't understand. You mean rebuilding the data to a new disk? Than
not "probably" -"definitely" because the process is supposed to be
seamless as opposed to backup file recovery & bare metal restore &
redoing the work since the last backup. The minor performance hit and
short time it takes to rebuild in the background is hardly worth even
trying to compare or consider (unless you are really concerned about
power failure and UPS runtime).

Basically under normal operations the system will continue to work without
failing once.

But that's a pretty crude comparison of reliability. That assertion
also depends on a lot of things.

Also you can't really say:
1. both are equally reliable because both are reliable enough to
usually work during a normal service life.
and at the same time say:
2. "two RAID1 raptors have equal costs to a single Cheetah 15K.3 and
a much better MTBF (theoretically)"
without the two statements either being contradictorily or virtually
valueless. Certainly I don't yet see the second statement as being
proven, explained, or correct.

The MTBF calculation I cited highlights the added complexity and
potential points of failure that raid brings and that is normally
interpreted as an array being "theoretically less reliable" than a
single disk.

That being said a properly implemented non-zero RAID "should" yield
"more reliable storage" but that has more to do with doing the work
and making the investment to "dot your i's" and "cross your t's" than
any theoretical calculation or characterization of _all_ or _any_
non-zero raid. It's very easy to botch a raid implementation and end
up with storage that is more expensive, more work, and less reliable
over normal disks. Storage & technical discussion groups regularly
have "help I didn't do a backup and I can't get my ultra-cheap raid
back online" posts. Once in a while you also see "Boy! I just found
out the hard way that most raid 5's are susceptible to transient write
errors" posts. These users were not 100% protected just because they
got disks to work together and generate ecc data.

You are probably have more problems with software than you
will have with hardware. Most down-time is either human or software related,
not hardware.

Yes. However it is not so uncommon for "minor" hardware error with
"working" devices. When you don't invest enough time to really
scientifically dissect & troubleshoot these issues they appear as
software problems when they are not or are simply unsolved & forgotten
about or ignored because you were lucky that it didn't affect anything
that important.

The issue is that when its hardware related, recovery costs
much more time and you have a bigger risk of losing valuable data.
When the disk will start to fail, it will probably be obsolute anyways,

Often yes. My preoccupation with robust data integrity features in
raid (here and elsewhere) has to do with transient error and failing
but still spinning media or power failure which just shouldn't ever
get the chance to crap on anything. If you have non-zero raid
something is very wrong if you _ever_ have to go to a backup or
"reinstall" or "rollback" or troubleshoot due to any kind of storage
HW issue. Without significant gains it's hard to justify the
additional expense, effort, or system complexity.

unless your system lasts for more than 6 years. Of course, when you want
this, you should rather prepare for the worst and have a 4-computer
clustered installed with fail-over capability.

Well if you are going to exceed the service life so dramatically you
are not exactly "wearing a belt & suspenders" no matter how much the
$$$ investment. All that redundancy is good to ensure uptime for a
"normal period" but is not necessarily a great tool for trying to
drain the last drop of blood out of antiquated and worn out HW because
of the overhead and expense.

Asuming its for luxery and I have here a system that is in operation for
already 5 years and is subject to frequent transports and some very
disk-intensive work at times, it never left me alone due to a hardware
failure (the normal minor stuff because I forgot some cables or didn't
attach them too well provided). All the products I used where the cheapest
compared to competitors, however some trades where made between brands when
I think for only a very small difference I could get something I expect to
be more reliable or better.

<acerbic comment>

A 5-year-old machine is hardly a "luxury."

</acerbic comment>

I have a few machines like that and nearly twice as old still running
and in service with mostly original parts (now doing very limited
tasks of course). It's exactly those machines that impressed upon me
some time ago that just because its "up" and "seems OK" doesn't
necessarily mean you can really depend on it 100%. Also timesavings
and confidence in HW really go a long way and greatly offset some
"sticker shock" expenses or at least regular upgrades/decommissions.
I've been finding it MUCH cheaper to replace these "working,"
"capable" machines than to try to continue to plug-along with "old,"
or "cheap" HW for a variety or reasons including reliability.

Of course I may just be too picky and fortunate.

kony · Dec 6, 2004

On Tue, 30 Nov 2004 17:42:03 +0100, "Joris Dobbelsteen"

One of the main points of raid is to be proactive about failures &
uptime. So I'm talking about things like being able see SMART/PFA
status and have the array move to a hot spare when SMART fails.
Providing notification about such an event (like SNMP traps, email,
pages, popup). Also initiating a process to validate the data against
the ecc data & check the media surface. Being able to reconfigure or
upgrade or provide info about the array without taking the whole
machine down. Being able to automate & schedule these upgrades or
validation checks for a more convenient time (so there is no
discernable performance hit). All this may seem like overkill, but it
really isn't if you want to get the full benefits of non-zero raid. I
don't see RAID0 as a viable choice for most scenarios. I also assume
that if you are looking to firmware raid (even ROMB, etc) you have
higher expectations than the quick and dirty OS software striped set.

AND model revisions

That's all more of an issue with ATA raid for a number or reasons esp
not being as configurable as scsi drives as well as firmware
limitations.

???

Where do you get this stuff?
It is very rare for ATA raid to need same size, model, make,
OR model revisions of drives. You can use any of the most
common ATA RAID controllers and plug in (just about
anything), keeping in mind that the smallest drive capacity
will be a limit, but not an "issue" per se.

But often things seem to be luxury simply because of sticker
shock. In some cases when time is very valuable products that bring
even small increases in productivity or assurance of quality can bring
real value despite this initial sticker shock. This is all really
relative though & must be taken case by case.

This is absolutely backwards.
You write "cases when time is very valuable", but everything
you'd mentioned is the most time consuming, largest burden
on administrator, and largest wear on the drives doing that
background scanning.

It is not "small increases in productivity or assurance of
quality".

You're still confused about that. It is not more
productive to toy around instead of just setting it up and
being done. There is no "assurance of quality", rather so
much more than can go wrong, and if you actually feel those
features you mentioned are needed, then I suggest that your
proposed solution should be AVOIDED LIKE THE PLAGUE, because
it seems the most problematic thing a company could ever
dump in a trash bin.

Outright and complete failure is one thing. Erratic behavior due to
poor design, overheating, damage & imminent failure is another. Yes
mechanical devices don't last as long as IC's but I think there is a
bit more to reliability than time to total failure.

So your proposed solution is going to fail first then. It
runs hotter, is much more complex and grafted together with
more to go wrong, and the expense makes it less likely that
there will be spare controllers and/or secondary systems
online already.

Frankly I'm not that worried about an ATA raid controller dying
prematurely or before a scsi hba. I'm more concerned about a low-end
controller having limitations which interfere with the ability for the
raid to reliability deliver on its core features/promises or conflicts
or poorly written code which waste the user's/administrator's time &
eats away at the assumed cost savings.

ROFLOL

You've gone on and on about features of your proposed
solution that eat away time, apparently it not only costs a
lot more to purchase but to administer as well. In the end
I suspect it'll cost about 8X as much including labor.

Yes but the array MTBF calculation characterizes arrays as more
complex than a single drive with more potential points of failure.

Yes when one disk drops off the other keeps going, but there is more
to it than that.
1. if both disks are the same age with the same wear they might die at
similar times so you might not have as much time as you think to get
the replacement. Failure rates occur in a "U" pattern are not linear
across time.
2. not all failures are neat and tidy

Either not so uncommon cases are real potential time/productivity
wasters which can invalidate the expected benefits of non-zero raid.

So then the ideal solution is to minimize unnecessary costs
on single-points such that there is more redundancy and more
frequent disk replacement, not to pour as much $$$$ into it
as possible and claim features will save you.

I have a few machines like that and nearly twice as old still running
and in service with mostly original parts (now doing very limited
tasks of course). It's exactly those machines that impressed upon me
some time ago that just because its "up" and "seems OK" doesn't
necessarily mean you can really depend on it 100%. Also timesavings
and confidence in HW really go a long way and greatly offset some
"sticker shock" expenses or at least regular upgrades/decommissions.
I've been finding it MUCH cheaper to replace these "working,"
"capable" machines than to try to continue to plug-along with "old,"
or "cheap" HW for a variety or reasons including reliability.

THATS JUST IT.
You've keep using unreliable boxes because you're deluded
into thinking their replacements need to cost multiplie
times as much as they really do. It is ludicrous to talk
about reliability and 5-10 year old hardware and "sticker
shock" as they relate to replacement. All you ever had to
do was stop wasting $$$ and replace the boxes more often,
replace the disks as often, and make the backups.

Curious George · Dec 6, 2004

THATS JUST IT.

No that isn't just it. You're deluding yourself into thinking that
because you are not critical of your hardware it is good enough for
everybody and in all environments with all tolerances for risk.
You're also missing the opportunity & value of being proactive about
storage health raid provides. Making assumptions about hardware is
not generally best practice and not at all appropriate in some
circumstances. You're still not reading critically what I'm writing
and drawing incorrect conclusions based on your own uncritical
assumptions.

kony · Dec 6, 2004

No that isn't just it. You're deluding yourself into thinking that
because you are not critical of your hardware it is good enough for
everybody and in all environments with all tolerances for risk.
You're also missing the opportunity & value of being proactive about
storage health raid provides. Making assumptions about hardware is
not generally best practice and not at all appropriate in some
circumstances. You're still not reading critically what I'm writing
and drawing incorrect conclusions based on your own uncritical
assumptions.

I'm more critical than you are apparently, because given any
particular budget I'd put more into redundancy, more discs,
more regular rotations, and if the budget allowed it, an
entire 2nd redundant system. Your delusions about the
band-aid you promote as a feel-good solution won't save your
bacon when there's a failure... it'll just cost more after
the failure as it did beforehand.

Curious George · Dec 7, 2004

I'm more critical than you are apparently,

ROTFLOL :0

because given any
particular budget I'd put more into redundancy, more discs,
more regular rotations, and if the budget allowed it, an
entire 2nd redundant system.

I love how you always interject "oranges" to argue against "apples."
That's an availability solution not really a "data integrity"
protection one per se.

You argue that:
- More competent & intelligent & automatic raid is too much work &
cost so the answer is setting up & administering a cluster!
- Robust automated self monitoring & correction with automated
notification is too much work & cost so instead you should buy lots or
hardware and rotate it manually!
- Budget should not be spent on technical staff or good hardware so if
you have money buy more cheap systems that need to be set up and
maintained and bring more points of failure, lower MTBF, etc. to the
setup as a whole!

Kinda sad really if you don't know the difference between a machine
doing work and a person. I don't know how you can feel you are a raid
expert with such a backwards approach to controlling downtime.

Your delusions about the
band-aid you promote as a feel-good solution won't save your
bacon when there's a failure... it'll just cost more after
the failure as it did beforehand.

No you're still talking out of your arse.

For example, the management features & data integrity features I'm
talking about are all fully automatic. They are not band-aids - they
are the real deal. A band-aid is when neither you nor the system have
any idea what the disks are doing and so you just replace them or
rotate them often (manually). All I'm proposing involves a decent
controller and an extra minute or two after array creation to
configure. Then you never touch it unless there is an already
automatically corrected problem or a warning about a projected
potential problem. The idea that that involves significantly greater
labor or time or administration or cost down the line than flying
blind with a system you don't know how well it may deal with
abberation is ludicrous. It's one of many examples of how you just
can't stop yammering about things you have no experience with, no
curiosity to look it up, or to listen when someone talks.

If you're really concerned about "saving your bacon" than data
integrity is just as important as availability. Actually more because
media can degrade & writes fail more than drives & frankly you
shouldn't be punished because of UPS runtime/powerfail, etc. Data
integrity is the real paydirt in ROI on raid protection in most
installations.

Also if you are really worried about administration costs & time then
you have to go easy with the manual rotations, complex redundant
systems & frequent upgrades/replacements - & instead need to increase
the simpler, more thorough, more intelligent, more durable automated
solutions.

Also the suggestion that clustered computers or clustered arrays don't
use/need controllers with the features I'm discussing is flat wrong.
I don't even think you are clear on what clustering accomplishes & how
it works. & idea that things like "Patrol Read" or equivalent is
going to make the drives die prematurely is just laughable.

To claim that making sure something delivers exactly what it's
supposed to is some how bad or a band-aid or strictly feel good effort
is totally absurd. It's how smart ppl make purchase & integration
decisions for solid systems. To claim that you know how _my_ machines
run or when or how I learned what I did is just silly invention. That
is one of many other points you distort through wrong assumptions.
The older machines I mentioned were cited simply as an example to
counter a relaxed attitude towards older & cheaper machines. In fact
they _always_ displayed minor stability problems (not related to age)
but have been kept around to understand non-standard or suboptimal
behavior. They are simply several of many long term experiments which
have revealed some surprising results I'm not going to share with you.
If you knew these results or the experiments or were not overly eager
to make assumptions you would interpret that last paragraph I wrote
totally differently.

You may think I'm overstating the importance of these protection
features & indeed they are not a requirement for every
machine/context. But their need _is_ relative to load & tolerance of
risk. If you took a strict "risk management" or ROI approach to
redundant technologies you would have a different perspective then
your "get by with base minimum" or if you have the money "throw lots
of stuff at a problem" philosophies. Your most basic
needs/wants/understanding are not everybody else's.

You act like I'm claiming robust raid is the ONLY way to go in ALL
systems. That's just not true. Just look at my other posts in this
thread. But for heavens sake, if you're going to do raid, and you are
looking for something more than a simple software volume - then don't
piddle around half-assed - do it & understand _all_ the benefits _&_
tradeoffs & squeeze all the protection & performance you can out of
it! I don't see what's so bad about that - at least you haven't said
anything compelling in all these threads against what seems to me as a
reasonable perspective/philosophy. If you're just used to coasting
through life and getting by with the base minimum half-assed - well
then I can see how you have a problem with this.

You keep claiming extreme cost but this is a delusion. Elsewhere the
comparison was ata vs scsi hw raid. Really if you compare apples to
apples i.e. Raptors w' good LSI or 3ware card(s) vs 10k scsi w'
similar spec/capacity card & hot-swap to hot-swap or cold-swap to
cold-swap, there is either not much or any price difference. You
also don't understand how these advanced features don't require you to
hold the machine's hand through constant human administration and how
well they accomplish risk avoidance which is, after all, the main
point & benefit of non-zero raid.

I know you know everything. But then its easy to know everything when
its based on invention, misunderstanding, and false assumptions about
things you have no experience with or haven't examined critically. I
don't have the time or inclination to continue this stupid back &
forth with you about storage you have no experience with, knowledge
about, or of issues you have no appreciation of. Your fantasy
drivel/troll is really too absurd to continue to acknowledge.
Humoring you has really been a total waste of time. Your raid posts
are little more than nuisance & troll. I don't know why or how you
feel qualified to yammer on like you do and how you aren't embarrassed
by many of the things you say. - Hey, more power to ya. Arrogance
goes a long way in this world & ignorance is indeed bliss.

kony · Dec 7, 2004

ROTFLOL :0

<snip>

How else could one put it?

You'd have us pay premium then STILL be susceptible to many
single-point failures, anything not covered under your
sales-pitch for the feature set.

Thank me for snipping out the ludicrous arguement you were
starting to make for redundancy being bad because of more
failure points! With that kind of flawed thinking who
would ever make backups of ANYTHING... since the backup IS
a redundancy, yet another failure point?

Basically you're just a shill.

Joris Dobbelsteen · Dec 8, 2004

Kony, George, take my advice.

This thread isn't going to stop until only one is left alone on the
discussion.

Its not getting to any conclusion or anything. Just let it die... We are
simply getting nowhere...

- Joris

15K rpm SCSI-disk

Curious George

kony

Curious George

kony

Curious George

kony

Joris Dobbelsteen