Zak said:
Linux does it in software, called Raid 6.
Netapp calls it RAID DP - teh D being Dual or Diagonal. NetApp's is
devilishly clever and simple - it involves making a normal RAID 4, with a
single parity disk, and doing plain diagonal parity across those disks:
block 0 from disk 0, block 1 from disk 1, block 2 from disk 2, and block 3
from parity would end up on the Dparity disk - except that every DP
diagonal skips a disk.
The interesting thing is that reconstructing only takes XOR operations -
just handled in the right order.
http://www.netapp.com/tech_library/ftp/3298.pdf
I read that document, and it seems to me to be a little misleading. There
are statements in the justification of the benefits of being able to repair
a double failure that are misleading.
But those don't effect the details of the implementation. In that area, it
is a straightforward "horizontal and vertical" parity scheme as used in old
tape drives (except they make the "vertical" parity diagonal, which helps
even the workload in the same way that RAID 5 does over RAID 4. This works,
of course, but it doesn have some drawbacks.
One is the requirement to double the number of parity disks, which makes it
more expensive (this is true of most such schemes, of course), but with
small raid group sizes typically used in such servers it can become
significant. For example, a 4+1 RAID group requires 25% overhead, but 4+2
is up to a 50% overhead. For this reason, NetApp suggests in the paper
going to larger RAID group sizes.
The other issue is performance. The paper claims a 2-3% performance cost
due to the extra writes to the second parity disk. NetApp uses a
proprietary file system to eliminate the penalty of doing small writes
traditionally associated with RAID 4 or 5. This allows them to do "full
stripe" writes all the time. I don't understand how that extends to the
second parity disk. That disk has the parity of a different set of drives
so cannot be included in the parity calculation already done for the first
parity set. So unless they have extended the system to write the whole set
of participating drives, i.e. all the horitontal parity groups at the same
time, I don't see how they avoid extra reads for the second parity parity
drive.
Can anyone shed some light on this.