RAID 5 failure... ARGH!!!

  • Thread starter Thread starter Calab
  • Start date Start date
C

Calab

I have a major problem here...

Machine is a Windows 2003 Server with an Adaptech 21610SA RAID controller.
This is a PCI controller card with 16 SATA ports. Connected to the
controller are eight 500gig SATA drives. There are two RAID 5 arrays, with
four drives each.

Recently, we had a drive fail. As expected, the array still functioned. We
shut the system down, swapped out the defective drive and started the system
up.We verified that the correct drive was swapped out and then let the
controller start the rebuilding process on the array.

Well, at about 30%, the system reported that the rebuild failed. The
replacement drive was surface scanned and it didn't show any errors. We do
not understand why the rebuild failed.

To make matters worse, even though the array should still be functional from
the three remaining drives, the adapter is reporting that the array has
failed. We cannot access the files on this array. We have tried removing the
failed drive, leaving the port on the controller empty, and tried
reinstalling the original drive. No change either way.

So, can anyone tell me what's happening here? Any ideas why the rebuild
might have failed? Why can't I access my degraded array?
 
Machine is a Windows 2003 Server with an Adaptech 21610SA RAID controller.
This is a PCI controller card with 16 SATA ports. Connected to the
controller are eight 500gig SATA drives. There are two RAID 5 arrays, with
four drives each.

Am I to understand that a single controller is responsible for 2
raid arrays ? Aaaarghhhh !
Recently, we had a drive fail. As expected, the array still functioned. We
shut the system down, swapped out the defective drive and started the system
up.We verified that the correct drive was swapped out and then let the
controller start the rebuilding process on the array.

Well, at about 30%, the system reported that the rebuild failed.

It takes hours for a 500 GB drive to be reconstructed.

If one of the other drives in the array fails during this
reconstruction you are in deep trouble.
(Thats where RAID 6 comes in.)
 
Am I to understand that a single controller is responsible for 2 raid
arrays ? Aaaarghhhh !


It takes hours for a 500 GB drive to be reconstructed.

If one of the other drives in the array fails during this reconstruction
you are in deep trouble. (Thats where RAID 6 comes in.)

Agreed, this is typical of a second drive failure.

At this point there are a few things I would recommend. If the data is
important don't keep trying to rebuild or anything else with the any of
the drives from the broken set (including the new one). Locate backups if
available and prepare to restore on to a new set.

If you want to try and recover the data from the broken set, make a byte
level image of each drive - ghost or dd. Once the first copy is made,
make a second copy of the set you can play with. There are a few RAID
rebuild tools available that might help - Google is your friend.

If you have trouble imaging any of the drives or you are not comfortable
with the imaging/recovery process a DR company will be required.

Best Regards,
 
Gerard Bok said:
Am I to understand that a single controller is responsible for 2
raid arrays ? Aaaarghhhh !

It's only two arrays because the Adaptec controller can't handle arrays
larger than two terabytes. I've been looking for an affordable PCI-e
solution, but no luck so far.
It takes hours for a 500 GB drive to be reconstructed.

It was left overnight, so no idea how long it took to fail.
If one of the other drives in the array fails during this
reconstruction you are in deep trouble.
(Thats where RAID 6 comes in.)

I hear you, but there are no errors/reasons given for the failure on any
other drives. I'm including my RAID event log for the failure below.


January 24, 2009 7:36:46 PM MST ERR win2k3 Drive in a
RAID-5 set failed: controller 1, logical device 2
January 24, 2009 7:36:46 PM MST ERR win2k3 Disk failed:
controller 1, channel 0, SCSI device ID 1
January 24, 2009 7:36:46 PM MST WRN win2k3 RAID-5 failover
operation failed because there are no failover devices assigned to this
RAID-5 set: controller 1, logical device 2
January 24, 2009 7:36:46 PM MST INF win2k3 Container
changed: controller 1, logical device 2
January 24, 2009 7:36:48 PM MST WRN 301:A01C-S--L02 win2k3 Logical
device is degraded: controller 1, logical device 2 ("Raid5 #1").
January 24, 2009 7:36:48 PM MST ERR 401:A01C1S01L-- win2k3 Failed
drive: controller 1, port 1 (Vendor: ST350064 Model: 1AS).
January 24, 2009 11:58:31 PM MST WRN 338:A01C-S--L-- win2k3 Periodic
scan found one or more degraded logical devices: controller 1. Repair as
soon as possible to avoid data loss.
January 25, 2009 1:35:05 AM MST WRN win2k3 Expanded event,
SCSI group, command timeout. Controller 1, channel 0, SCSI device ID 3, LUN
0, cdb [28 00 16 a6 8a e0 00 01 20 00 00 00]
January 25, 2009 1:35:05 AM MST WRN win2k3 Expanded event,
SCSI group, bus reset. Controller 1, bus 0, isInBound 0
January 25, 2009 1:35:05 AM MST WRN win2k3 Expanded event,
SCSI group, bus reset. Controller 1, bus 0, isInBound 0
January 25, 2009 8:01:40 AM MST WRN 338:A01C-S--L-- win2k3 Periodic
scan found one or more degraded logical devices: controller 1. Repair as
soon as possible to avoid data loss.
January 25, 2009 4:04:45 PM MST WRN 338:A01C-S--L-- win2k3 Periodic
scan found one or more degraded logical devices: controller 1. Repair as
soon as possible to avoid data loss.
January 26, 2009 12:07:50 AM MST WRN 338:A01C-S--L-- win2k3 Periodic
scan found one or more degraded logical devices: controller 1. Repair as
soon as possible to avoid data loss.
January 26, 2009 1:03:22 AM MST WRN 301:A01C-S--L02 win2k3 Logical
device is degraded: controller 1, logical device 2 ("Raid5 #1").
January 26, 2009 1:04:22 AM MST INF 304:A01C-S--L02 win2k3 Rebuilding:
controller 1, logical device 2 ("Raid5 #1").
January 26, 2009 2:05:07 AM MST INF win2k3 Running: RAID 5
rebuild - 5%. Controller 1, logical device 2
January 26, 2009 3:08:00 AM MST INF win2k3 Running: RAID 5
rebuild - 10%. Controller 1, logical device 2
January 26, 2009 4:12:17 AM MST INF win2k3 Running: RAID 5
rebuild - 15%. Controller 1, logical device 2
January 26, 2009 5:16:30 AM MST INF win2k3 Running: RAID 5
rebuild - 20%. Controller 1, logical device 2
January 26, 2009 6:21:03 AM MST INF win2k3 Running: RAID 5
rebuild - 25%. Controller 1, logical device 2
January 26, 2009 7:25:26 AM MST INF win2k3 Running: RAID 5
rebuild - 30%. Controller 1, logical device 2
January 26, 2009 7:36:09 AM MST INF win2k3 Expanded event,
container group, PPI update. Age 2,392
January 26, 2009 7:36:09 AM MST INF win2k3 Container
changed: controller 1, logical device 2
January 26, 2009 7:36:09 AM MST INF win2k3 Logical device
deleted: controller 1, logical device 2
January 26, 2009 7:36:09 AM MST INF win2k3 Adapter text
event: Container 1 failed REBUILD task: I/O error - drive 0:3:0 failed
controller 1
January 26, 2009 7:36:09 AM MST ERR win2k3 Failed: RAID 5
rebuild - 30%. Controller 1, logical device 2
January 26, 2009 7:36:09 AM MST INF win2k3 Configuration
has changed.
January 26, 2009 7:36:09 AM MST ERR 303:A01C-S--L02 win2k3 Logical
device failed: controller 1, logical device 2.
January 26, 2009 7:36:09 AM MST ERR 306:A01C-S--L02 win2k3 Rebuild
failed: controller 1, logical device 2 [0x00].
January 27, 2009 1:02:14 PM MST ERR 303:A01C-S--L02 win2k3 Logical
device failed: controller 1, logical device 2.
 
Rob McCrea said:
Agreed, this is typical of a second drive failure.

My other reply shows the RAID event log and it doesn't look like another
drive failed.
At this point there are a few things I would recommend. If the data is
important don't keep trying to rebuild or anything else with the any of
the drives from the broken set (including the new one). Locate backups if
available and prepare to restore on to a new set.

It looks like I'm going to have to start investing in some of these 1.5TB
drives... The point of going RAID was to lessen (not eliminate) the need for
constant backups.
If you want to try and recover the data from the broken set, make a byte
level image of each drive - ghost or dd. Once the first copy is made,
make a second copy of the set you can play with. There are a few RAID
rebuild tools available that might help - Google is your friend.

Thanks! I hadn't though about looking for utilities online.
If you have trouble imaging any of the drives or you are not comfortable
with the imaging/recovery process a DR company will be required.

I know... and this is going to be expensive. : (

Thanks!
 
2 January 26, 2009 7:36:09 AM MST INF win2k3 Adapter
text event: Container 1 failed REBUILD task: I/O error - drive 0:3:0
failed controller 1

Note the I/O error part.

Best Regards,
 
It looks like I'm going to have to start investing in some of these
1.5TB drives... The point of going RAID was to lessen (not eliminate)
the need for constant backups.

=) RAID in no way even lessens the need for backups - I did a blog post
about this a little while back - http://www.zebralogic.ca/node/4 I run
into a lot of IT people that have misconceptions about where RAID falls
into the storage management structure.
I know... and this is going to be expensive. : (

It can seem like it but when compared to the cost of rebuilding data or
customer relations it is not as expensive as it may seem.

Best Regards,
 
Rob said:
=) RAID in no way even lessens the need for backups - I did a blog post
about this a little while back - http://www.zebralogic.ca/node/4 I run
into a lot of IT people that have misconceptions about where RAID falls
into the storage management structure.

<<snip>>

Consider what happens to a server, if the power supply fails
and due to its elevated output voltage, burns all the hard drives
inside the computer at the same instant in time. RAID
won't protect against that.

And that also means, that the backup drive shouldn't be drawing
power from the same supply as all the RAID disks, if you expect the
backup to be available when you need it. That means the backup should
preferably be in another box.

Offsite backup is all part of your disaster recovery plan, where
you use as much distance, as you can imagine a disaster encompassing.
Could a fire or flood in the building, destroy your server and
the backup machine ? Should you use another building in town,
to hold the backup ? Should you be storing the backup tapes,
outside the building ?

At my old employer, backups were done... in another country :-)
How's that for disaster planning ?

Paul
 
If you want to try and recover the data from the broken set, make a byte
level image of each drive - ghost or dd. Once the first copy is made,
make a second copy of the set you can play with. There are a few RAID
rebuild tools available that might help - Google is your friend.

I have just one question about the retrieval software that I've found
online.

Since the array is damaged, it is not available for software to find. Do
these programs usually require the drives from the array to be removed from
the RAID controller and connected directly to standard ports so that Windows
will see each drive as an individual, unready, drive?
 
No, they can't be used separate from being in the same
array, even if connected to the same controller. If the
software can't see them, it simply won't work to retrieve
the data. What you might be able to do instead is as
previously suggested, connected to that or another
controller you can make an exact image of the drives onto
new drives, then try to recover the new drives leaving the
originals intact.

Or, given the fact that his log also showed adapter I/O errors:
get a replacement for the controller.
It would be a waste of time and effort to 'fix faulty drives'
when in fact one or more ports on the controller have past away.

Re-reading some of the thread, I even dare raising another
question for OP:
Are you absolutely sure that you correctly identified the failing
drive and pulled that one from the array ?
As in: if you misidentfied the faulty drive and pulled it's
neighbour, you would end up with almost the same error :-)
 
Calab said:
I have a major problem here...

Machine is a Windows 2003 Server with an Adaptech 21610SA RAID controller.
This is a PCI controller card with 16 SATA ports. Connected to the
controller are eight 500gig SATA drives. There are two RAID 5 arrays, with
four drives each.
To make matters worse, even though the array should still be functional
from the three remaining drives, the adapter is reporting that the array
has failed. We cannot access the files on this array. We have tried
removing the failed drive, leaving the port on the controller empty, and
tried reinstalling the original drive. No change either way.

At this point I...

- have had the problem array of drives disconnected from the PC to avoid it
being altered.
- have installed a new 1.5TB drive into the system
- ensured that my current array was workign well by:
- backed up my working array to the 1.5TB drive
- then erased and recreated a RAID5 array
- copied the data back to the array.

At this point, I have the 1.5TB drive OR the RAID5 array that I can format
and use to try and recover my failed array.

What I'd like to do is...
- create sector by sector images of each drive onto the 1.5TB drive using
Linux or a bootable program
- use a RAID recovery program to try and fix the array images
- OR -
- do a sector by sector copy of each failed array drive to the drives from
the working array
- use a RAID recovery program to try and fix the copied array

The failed array drives are NOT visable to the computer at all because the
Adaptec controller has flagged the array as failed. This means that I need
to connect these drives to a SATA connector on the mainboard. This should
work fine, as I'm doing a sector by sector image, right?

Any comments are welcome. Wish me luck!
 
The failed array drives are NOT visable to the computer at all because the
Adaptec controller has flagged the array as failed. This means that I need
to connect these drives to a SATA connector on the mainboard. This should
work fine, as I'm doing a sector by sector image, right?

Wrong.
- Windows is notorious for writing to media it has no business
writing to. Don't trust it not to do so.
- Most probably, the imaging program you intent to use for your
sector by sector copy will refuse to operate on a partition it
cannot identify. Your part-of-RAID partition is likely to qualify
as such, I'm afraid.
Any comments are welcome.

Do the maths. Is your data valuable ?
If so, hire a professional.
If it isn't, decide wether it is wise to spend more time on it
:-)
Wish me luck!

Good luck !
 
If the imaging program is doing a sector by sector raw image, it should
not matter if there is a partition, only that the drive is accessible.

When I do this, I generally do not have any original hardware involved. I
have the most simple computer setup possible - this minimizes the number
of failure points. I usually have two drives connected to a computer at
one time. One drive is the source drive (RAID element) and the other is
where the image will be stored - in your case this would likely be the
1.5TB drive. Then a byte copy of the source drive with the destination
being a file on the second formatted drive. When everything is done you
will end up with a destination drive containing an image file for each
element of your array.

NOTE: at no time was the RAID controller involved.

Once this is done the RAID recovery software can operate on the images
which reside on the destination drive. The recovery software will often
de-RAID the individual images into a single unified disk image.

Since you had a failed rebuild, you may have to run some file system
reconstruction tools on the unified image. Once the file system will
mount you can, extract the files from the image and copy them to the new
RAID device, copy the unified image to a standalone hard drive or
transfer the image to the new RAID device.

This can become a very long and complicated process. If you are not
completely comfortable doing this you will save a lot of time and money
if you talk to a DR company.

Best Regards,
 
Since you had a failed rebuild, you may have to run some file system
reconstruction tools on the unified image. Once the file system will
mount you can, extract the files from the image and copy them to the new
RAID device, copy the unified image to a standalone hard drive or
transfer the image to the new RAID device.

I expect this. I'm hoping that my data is recoverable. It's not highly
valuable, but it is important to me.
This can become a very long and complicated process. If you are not
completely comfortable doing this you will save a lot of time and money
if you talk to a DR company.

I'm hearing quotes of $6000 to recover my data. That's just insane!
 
I'm hearing quotes of $6000 to recover my data. That's just insane!

Insane - I'm in the data recovery biz and one of my areas of focus is RAID
recoveries ;-) There are a couple things to keep in mind when dealing
with those numbers. First, for every (billable) hour I have spent actually
working on a recovery there are at **least** ten spent learning about how
something works and coming up with a method to successfully recover the
data. That doesn't include overhead or development costs for job specific
software. This is a boutique business and requires a depth of knowledge
requiring many years to learn. I have, for example, worked everywhere from
clean room fabs making IC's, to building and testing the machines that
make DVD's and HDD platters to doing multi OS kernel development to
support storage devices as well as developing the file systems stored on
them.

All that said, I can at the same time see how that number can seem insane
to an individual and that brings me to the second point. There is value
for the clients I service. I can help them get information back that will
keep a business in business and give people a chance to recover the things
they have worked hard to build. When you consider $6000 in comparison to a
$1000000 company going out of business the number doesn't seem so insane.

Quickly add up the time you have spent on this project. If this is a
business include the downtime and multiply by the burn rate of the
company and the business lost by being down. Is the number bigger than
you thought?

Best of luck,
 
All that said, I can at the same time see how that number can seem insane
to an individual and that brings me to the second point. There is value
for the clients I service. I can help them get information back that will
keep a business in business and give people a chance to recover the things
they have worked hard to build. When you consider $6000 in comparison to a
$1000000 company going out of business the number doesn't seem so insane.

Quickly add up the time you have spent on this project. If this is a
business include the downtime and multiply by the burn rate of the
company and the business lost by being down. Is the number bigger than
you thought?

And contrast that to the cost of keeping regular backups :-)
 
Calab said:
I have a major problem here...

Recently, we had a drive fail. As expected, the array still functioned.
We verified that the correct drive was swapped out and then let the
controller start the rebuilding process on the array.

Well, at about 30%, the system reported that the rebuild failed.

I've been working on this problem for a while. Most of the time waiting for
drive backups to complete, etc.

Now I have all I'm wondering about something... The manufacturers utility
should be able to let me set these drives as READ ONLY. If I do this, will
it make them safe for mounting in Windows, etc? Will they truly be read
only, or will the OS, etc. ignore that flag?

Thanks!
 
I've been working on this problem for a while. Most of the time waiting
for drive backups to complete, etc.

Now I have all I'm wondering about something... The manufacturers
utility should be able to let me set these drives as READ ONLY. If I do
this, will it make them safe for mounting in Windows, etc? Will they
truly be read only, or will the OS, etc. ignore that flag?

Thanks!

To which utility are you referring? If you are looking for a tool that
will block all writes to a drive no matter the OS it is called a write
blocker and it is an actual piece of hardware. It is typically used in
doing forensic investigations.

If you are familiar with Linux, BSD or some Unix variant you can make
things safe for mounting. Windows and Safe do not go together without a
write blocker.

Best Regards,
 
Back
Top