RAID Controller Failover

Steve Holly · Nov 7, 2003

I'm an IT admin starting to look at building a SAN for my company and I'm
curious if anyone out there can explain how some of the popular RAID vendors
(i.e. EMC, Chaparral, Infortrend) handle failover (resuming I/O with another
RAID controller after one has failed)? I'm mostly interested in failover on
the storage side (as opposed to the host side).

Specifically I'm interested in knowing if failover is generally accomplished
by a surviving controller taking over the failed controller's (or failed
port's) AL_PA('s) or if surviving controllers actually alias failed
controller's WWN's?

Or is this something that's generally handled at the switch level.

I'm trying to better understand how failover is accomplished transparent to
the host. Many thanks for any input regarding this.

canotto · Nov 8, 2003

i'm not an IT manager

but a raid 5 solution is imho the best!

Jake Roersma · Nov 8, 2003

I'm an IT admin starting to look at building a SAN for my company and I'm
curious if anyone out there can explain how some of the popular RAID vendors
(i.e. EMC, Chaparral, Infortrend) handle failover (resuming I/O with another
RAID controller after one has failed)? I'm mostly interested in failover on
the storage side (as opposed to the host side).

Specifically I'm interested in knowing if failover is generally accomplished
by a surviving controller taking over the failed controller's (or failed
port's) AL_PA('s) or if surviving controllers actually alias failed
controller's WWN's?

I'm assuming that you are talking about failover within the same
storage unit, and not between two physical units. I'm not too familar with
how EMC does it, but most (I'm sure not all; everyone does things
different) vendors will present the controllers with one WWN and the
failover is completely transparent to the host.

Some vendors will also distribute the controllers with seperate WWN's
which will rely on the host to fail over. This will be controlled similar
to that of a lost disk/path failover where each controller is its own
path to the same disk. When that path is lost (or the controller dies)
then the software, LVM or vendor software fails the I/O over after a
certain amount of time.

If you are talking about a failover between two physical arrays (this
would only happen under very strange circumstances) then this will have to
be handled by another piece of software. Possibely a high availability
package, or LVM where if the disk along with all paths are lost. So the
software in this case is resposible fo detecting a failure and switching
to the secondary disk.

I hope this helps.

- Jake

Maxim S. Shatskih · Nov 8, 2003

If you want to save money on disks - yes.
If the disk drive cost is neglectable for you - then RAID1+0 is better by
far.

RAID4 and RAID5 are very slow on writes.

Jake Roersma · Nov 8, 2003

If you want to save money on disks - yes.
If the disk drive cost is neglectable for you - then RAID1+0 is better by
far.

RAID4 and RAID5 are very slow on writes.

I've noticed as the controllers get more advanced that the caching and
other alogrithms used minimize the actual write times that the host
system sees. My tests on the recent HP equipment show that the difference
in write times between RAID-1 and RAID-5 are within 1MB of each other. I
have a hard time believing that anyone would be driven away from RAID-5
due to performance factors on high-end equipment. I have the bonnie++
stats if you'd like to see them.

- Jake

Zak · Nov 8, 2003

Jake said:
I've noticed as the controllers get more advanced that the caching and
other alogrithms used minimize the actual write times that the host
system sees. My tests on the recent HP equipment show that the difference
in write times between RAID-1 and RAID-5 are within 1MB of each other. I
have a hard time believing that anyone would be driven away from RAID-5
due to performance factors on high-end equipment. I have the bonnie++
stats if you'd like to see them.

Random small writes can still kill you. One write turns into
read-read-write-write. Latency doubles, throughput is a quarter.

Thomas

Mr. Grinch · Nov 18, 2003

I'm an IT admin starting to look at building a SAN for my company and
I'm curious if anyone out there can explain how some of the popular
RAID vendors (i.e. EMC, Chaparral, Infortrend) handle failover
(resuming I/O with another RAID controller after one has failed)? I'm
mostly interested in failover on the storage side (as opposed to the
host side).

Specifically I'm interested in knowing if failover is generally
accomplished by a surviving controller taking over the failed
controller's (or failed port's) AL_PA('s) or if surviving controllers
actually alias failed controller's WWN's?

Or is this something that's generally handled at the switch level.

I'm trying to better understand how failover is accomplished
transparent to the host. Many thanks for any input regarding this.

The EMC Symetrix storage units have duplicate everything... duplicate
incomming fibre cards, duplicate bus to scsi cards, duplicate scsi cards,
and duplicate scsi bus to disk arrays.

I believe the state of any redundant scsi cards is kept the same at any
time, even if only one is actually writing. The software is smart enough
to know what is committed to disk and what is not, and so can recover from
write errors. Of course everything is battery backed up.

RAID Controller Failover

Steve Holly

canotto

Jake Roersma

Maxim S. Shatskih

Jake Roersma

Zak

Mr. Grinch