Problems with Limux software RAID after OS upgrade (long)

  • Thread starter Thread starter Mike Tomlinson
  • Start date Start date
M

Mike Tomlinson

Having some trouble with Linux software RAID after an OS update, and
would be grateful for any insights.

Machine is an AMD 64-bit PC running 32-bit Linux. The machine was
previously running Fedora Core 4 with no problems. Two 500GB hard
drives were added to the onboard Promise controller and the Promise
section of the machine's BIOS configured for JBOD.

On boot, as expected, two new SCSI disk devices could be seen - sda and
sdb. These were partitioned using fdisk, a single partition occupying
the entire disk created, and the partition type set to 0xfd (Linux RAID
autodetect).

mdadm was used to create a RAID1 (mirror) using /dev/sda and /dev/sdb.
I can't remember for certain if I used the raw devices (/dev/sda) or the
partitions (/dev/sda1) to create the array, and my notes aren't clear.
The resulting RAID device, /dev/md0, had an ext3 filesystem created on
it and was mounted on a mount point. /etc/fstab was edited to mount
/dev/md0 on boot.

This arrangement worked well until recently, when the root partition on
the (separate) boot drive was trashed and Fedora Core 6 installed by
someone else, so I have only their version of events to go by. The
array did not reappear after FC6 was installed. The /etc/raidtab and/or
/dev/mdadm.conf files were not preserved, so I am working blind to
reassemble and remount the array.

Now things are confused. The way Linux software RAID works seems to
have changed in FC6. On boot, dmraid is run by rc.sysinit and discovers
the two members of the array OK and mounts it on
/dev/mapper/pdc_eejidjjag, where pdc_eejidjjag is the array's name:

[root@linuxbox root]# dmraid -r
/dev/sda: pdc, "pdc_eejidjjag", mirror, ok, 976562500 sectors, data@ 0
/dev/sdb: pdc, "pdc_eejidjjag", mirror, ok, 976562500 sectors, data@ 0

[root@linuxbox root]# dmraid -ay -v
INFO: Activating mirror RAID set "pdc_eejidjjag"
ERROR: dos: partition address past end of RAID device

[root@linuxbox root]# ls -l /dev/mapper/
total 0
crw------- 1 root root 10, 63 Jul 5 16:59 control
brw-rw---- 1 root disk 253, 0 Jul 6 03:11 pdc_eejidjjag

[root@linuxbox root]# fdisk -l /dev/mapper/pdc_eejidjjag

Disk /dev/mapper/pdc_eejidjjag: 500.0 GB, 500000000000 bytes
255 heads, 63 sectors/track, 60788 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device /dev/mapper/pdc_eejidjjag1
Boot
Start 1
End 60801
Blocks 488384001
Id fd
System Linux raid autodetect

I cannot mount /dev/mapper/pdc_eejidjjag1:

[root@linuxbox root]# mount -v -t auto /dev/mapper/pdc_eejidjjag1
/mnt/test
mount: you didn't specify a filesystem type for
/dev/mapper/pdc_eejidjjag1
I will try all types mentioned in /etc/filesystems or
/proc/filesystems
Trying hfsplus
mount: special device /dev/mapper/pdc_eejidjjag1 does not exist

'fdisk -l /dev/mapper/pdc_eejidjjag' shows that one partition of type
0xfd (Linux raid autodetect) is filling the disk. Surely this should be
type 0x83, since the device is the RAIDed disk as presented to the user?
And why does mount say the device /dev/mapper/pdc_eejidjjag1 does not
exist?

This may be due to my unfamiliarity with dmraid. I can find little
about it on the internet. I'm uncertain if it is meant to be used in
conjunction with mdadm, or whether it's either/or. In the past, Linux
software RAID has Just Worked for me using mdadm.

If I disregard dmraid, disabling the array with 'dmraid -an /dev/md0'
and use the more familiar mdadm instead, first checking with fdisk that
the disks have the correct RAID autodetect partitions:

[root@linuxbox root]# fdisk -l /dev/sda

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 1 60801 488384001 fd Linux raid
autodetect

[root@linuxbox root]# fdisk -l /dev/sda

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 1 60801 488384001 fd Linux raid
autodetect

then try to assemble the RAID with those, it fails:

[root@linuxbox root]# mdadm -v --assemble /dev/md0 /dev/sda1 /dev/sdb1
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sda1: No such device or address
mdadm: /dev/sda1 has no superblock - assembly aborted

Perhaps I should be using the raw devices?

[root@linuxbox root]# mdadm -v --assemble /dev/md0 /dev/sda /dev/sdb
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 1.
mdadm: added /dev/sdb to /dev/md0 as 1
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: /dev/md0 has been started with 2 drives.

[root@linuxbox root]# mdadm -E /dev/sda
/dev/sda:
Magic : a92b4efc
Version : 00.90.01
UUID : c4344083:a8d8cf32:3f00e0db:8765b21b
Creation Time : Thu Mar 22 15:26:52 2007
Raid Level : raid1
Device Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0

Update Time : Thu Jul 5 16:58:02 2007
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : 864ad759 - correct
Events : 0.4


Number Major Minor RaidDevice State
this 0 8 0 0 active sync /dev/sda

0 0 8 0 0 active sync /dev/sda
1 1 8 16 1 active sync /dev/sdb

[root@linuxbox root]# mdadm -E /dev/sdb
/dev/sdb:
Magic : a92b4efc
Version : 00.90.01
UUID : c4344083:a8d8cf32:3f00e0db:8765b21b
Creation Time : Thu Mar 22 15:26:52 2007
Raid Level : raid1
Device Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0

Update Time : Thu Jul 5 16:58:02 2007
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : 864ad76b - correct
Events : 0.4


Number Major Minor RaidDevice State
this 1 8 16 1 active sync /dev/sdb

0 0 8 0 0 active sync /dev/sda
1 1 8 16 1 active sync /dev/sdb

so that looks OK. Let's see what /dev/md0 looks like:

[root@linuxbox root]# fdisk -l /dev/md0

Disk /dev/md0: 500.1 GB, 500107771904 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/md0p1 1 60801 488384001 fd Linux raid
autodetect

That doesn't look right; I would have expected to see a partition of
type 0x83, since /dev/md0p1 is the RAID as presented to the user
according to fdisk. Trying to mount it anyway:

[root@linuxbox root]# mount -v -t auto /dev/md0 /mnt/test
mount: you didn't specify a filesystem type for /dev/md0
I will try all types mentioned in /etc/filesystems or
/proc/filesystems
Trying hfsplus
mount: you must specify the filesystem type

[root@linuxbox root]# mount -v -t auto /dev/md0p1 /mnt/test
mount: you didn't specify a filesystem type for /dev/md0p1
I will try all types mentioned in /etc/filesystems or
/proc/filesystems
Trying hfsplus
mount: special device /dev/md0p1 does not exist

mdadm --examine /dev/sd* shows both members of the array as correct,
with the same serial number. "cat /proc/mdstat" shows the array as
complete and OK with two members as expected.

/proc/mdstat shows:

[root@linuxbox root]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda[0] sdb[1]
488386496 blocks [2/2] [UU]

unused devices: <none>

I'm confused. I can't find much information on dmraid; the man page
seems to imply that it's for use with hardware RAID controllers, and I
don't know if I should be using that or mdadm, or both. Previously I
just used mdadm and everything Just Worked.

I don't know why assembling and starting the array doesn't present the
contents of the md device as expected, and why fdisk shows special
devices in /dev which the mount command says don't exist.

The user of the machine is getting worried as there's a lot of data on
this array, and of course, he has no backup.

I'm at the point of taking the disks out and trying them in a machine
running FC4. Any ideas or suggestions please before I do that?
 
In comp.sys.ibm.pc.hardware.storage Mike Tomlinson said:
Having some trouble with Linux software RAID after an OS update, and
would be grateful for any insights.
Machine is an AMD 64-bit PC running 32-bit Linux. The machine was
previously running Fedora Core 4 with no problems. Two 500GB hard
drives were added to the onboard Promise controller and the Promise
section of the machine's BIOS configured for JBOD.

I assume that is individual disks, instead of the JBOD "RAID" mode?
On boot, as expected, two new SCSI disk devices could be seen - sda and
sdb. These were partitioned using fdisk, a single partition occupying
the entire disk created, and the partition type set to 0xfd (Linux RAID
autodetect).
Ok.

mdadm was used to create a RAID1 (mirror) using /dev/sda and /dev/sdb.
I can't remember for certain if I used the raw devices (/dev/sda) or the
partitions (/dev/sda1) to create the array, and my notes aren't clear.

That is important. With partitions the RAID would start automatically
because of type 0xfd. With whole drives it woulrd not and require
some start script. Also the partitioning left on the disks if you
used the whole disk will confuse RAID auto-detectors.
The resulting RAID device, /dev/md0, had an ext3 filesystem created on
it and was mounted on a mount point. /etc/fstab was edited to mount
/dev/md0 on boot.
ok.

This arrangement worked well until recently, when the root partition on
the (separate) boot drive was trashed and Fedora Core 6 installed by
someone else, so I have only their version of events to go by. The
array did not reappear after FC6 was installed. The /etc/raidtab and/or
/dev/mdadm.conf files were not preserved, so I am working blind to
reassemble and remount the array.

Should not be a problem. If you try to reassemble, any part not having
a valid RAID signature will be rejected.
Now things are confused. The way Linux software RAID works seems to
have changed in FC6. On boot, dmraid is run by rc.sysinit and discovers
the two members of the array OK and mounts it on
/dev/mapper/pdc_eejidjjag, where pdc_eejidjjag is the array's name:

Hmmm. From what I can see dmraid is not intended for normal
software RAID, but rather for fakeRAID controllers (software
RAID done by BIOS code). It may also be able to handle normal
software RAID, but I have never used it.
[root@linuxbox root]# dmraid -r
/dev/sda: pdc, "pdc_eejidjjag", mirror, ok, 976562500 sectors, data@ 0
/dev/sdb: pdc, "pdc_eejidjjag", mirror, ok, 976562500 sectors, data@ 0
[root@linuxbox root]# dmraid -ay -v
INFO: Activating mirror RAID set "pdc_eejidjjag"
ERROR: dos: partition address past end of RAID device
[root@linuxbox root]# ls -l /dev/mapper/
total 0
crw------- 1 root root 10, 63 Jul 5 16:59 control
brw-rw---- 1 root disk 253, 0 Jul 6 03:11 pdc_eejidjjag
[root@linuxbox root]# fdisk -l /dev/mapper/pdc_eejidjjag
Disk /dev/mapper/pdc_eejidjjag: 500.0 GB, 500000000000 bytes
255 heads, 63 sectors/track, 60788 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device /dev/mapper/pdc_eejidjjag1
Boot
Start 1
End 60801
Blocks 488384001
Id fd
System Linux raid autodetect

I cannot mount /dev/mapper/pdc_eejidjjag1:
[root@linuxbox root]# mount -v -t auto /dev/mapper/pdc_eejidjjag1
/mnt/test
mount: you didn't specify a filesystem type for
/dev/mapper/pdc_eejidjjag1
I will try all types mentioned in /etc/filesystems or
/proc/filesystems
Trying hfsplus
mount: special device /dev/mapper/pdc_eejidjjag1 does not exist
'fdisk -l /dev/mapper/pdc_eejidjjag' shows that one partition of type
0xfd (Linux raid autodetect) is filling the disk. Surely this should be
type 0x83, since the device is the RAIDed disk as presented to the user?
And why does mount say the device /dev/mapper/pdc_eejidjjag1 does not
exist?

Because this works differently. The problem is that the check for
partitions is done by the pernel. Itt seems thet it is done before
assembly of the RAID array, and hence no partition discovery is done
for it.
This may be due to my unfamiliarity with dmraid. I can find little
about it on the internet. I'm uncertain if it is meant to be used in
conjunction with mdadm, or whether it's either/or. In the past, Linux
software RAID has Just Worked for me using mdadm.

By all means go back to mdadm. dmraid has no business being run
automatically. The people that configured it that way screwed up IMO.
If I disregard dmraid, disabling the array with 'dmraid -an /dev/md0'
and use the more familiar mdadm instead, first checking with fdisk that
the disks have the correct RAID autodetect partitions:
[root@linuxbox root]# fdisk -l /dev/sda
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 1 60801 488384001 fd Linux raid
autodetect
[root@linuxbox root]# fdisk -l /dev/sda
Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 1 60801 488384001 fd Linux raid
autodetect
then try to assemble the RAID with those, it fails:
[root@linuxbox root]# mdadm -v --assemble /dev/md0 /dev/sda1 /dev/sdb1
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sda1: No such device or address
mdadm: /dev/sda1 has no superblock - assembly aborted
Perhaps I should be using the raw devices?
[root@linuxbox root]# mdadm -v --assemble /dev/md0 /dev/sda /dev/sdb
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 1.
mdadm: added /dev/sdb to /dev/md0 as 1
mdadm: added /dev/sda to /dev/md0 as 0
mdadm: /dev/md0 has been started with 2 drives.

So you definitely used the whole devices (a mistake with software RAID
IMO, but you can do it), and the partition tables are only left
because they have not yet been overwritten. They do confuse the
autodetection script, though.
[root@linuxbox root]# mdadm -E /dev/sda
/dev/sda:
Magic : a92b4efc
Version : 00.90.01
UUID : c4344083:a8d8cf32:3f00e0db:8765b21b
Creation Time : Thu Mar 22 15:26:52 2007
Raid Level : raid1
Device Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Update Time : Thu Jul 5 16:58:02 2007
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : 864ad759 - correct
Events : 0.4

Number Major Minor RaidDevice State
this 0 8 0 0 active sync /dev/sda
0 0 8 0 0 active sync /dev/sda
1 1 8 16 1 active sync /dev/sdb
[root@linuxbox root]# mdadm -E /dev/sdb
/dev/sdb:
Magic : a92b4efc
Version : 00.90.01
UUID : c4344083:a8d8cf32:3f00e0db:8765b21b
Creation Time : Thu Mar 22 15:26:52 2007
Raid Level : raid1
Device Size : 488386496 (465.76 GiB 500.11 GB)
Array Size : 488386496 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Update Time : Thu Jul 5 16:58:02 2007
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : 864ad76b - correct
Events : 0.4

Number Major Minor RaidDevice State
this 1 8 16 1 active sync /dev/sdb
0 0 8 0 0 active sync /dev/sda
1 1 8 16 1 active sync /dev/sdb
so that looks OK. Let's see what /dev/md0 looks like:
[root@linuxbox root]# fdisk -l /dev/md0
Disk /dev/md0: 500.1 GB, 500107771904 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/md0p1 1 60801 488384001 fd Linux raid
autodetect

You do not have that partition! Unless you did partition /dev/md0?
If not, this is leftover junk from your first partitioning that you
then did not use. It confises dmraid and should be removed, see below.
That doesn't look right; I would have expected to see a partition of
type 0x83, since /dev/md0p1 is the RAID as presented to the user
according to fdisk. Trying to mount it anyway:
[root@linuxbox root]# mount -v -t auto /dev/md0 /mnt/test
mount: you didn't specify a filesystem type for /dev/md0
I will try all types mentioned in /etc/filesystems or
/proc/filesystems
Trying hfsplus
mount: you must specify the filesystem type
[root@linuxbox root]# mount -v -t auto /dev/md0p1 /mnt/test
mount: you didn't specify a filesystem type for /dev/md0p1
I will try all types mentioned in /etc/filesystems or
/proc/filesystems
Trying hfsplus
mount: special device /dev/md0p1 does not exist
mdadm --examine /dev/sd* shows both members of the array as correct,
with the same serial number. "cat /proc/mdstat" shows the array as
complete and OK with two members as expected.
/proc/mdstat shows:
[root@linuxbox root]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda[0] sdb[1]
488386496 blocks [2/2] [UU]
unused devices: <none>
I'm confused. I can't find much information on dmraid; the man page
seems to imply that it's for use with hardware RAID controllers, and I
don't know if I should be using that or mdadm, or both. Previously I
just used mdadm and everything Just Worked.
I don't know why assembling and starting the array doesn't present the
contents of the md device as expected,

Why, but it does? You said that you created an ext3 on it, so why
not just mount /dev/md0 directly? I think you have indeed gotten a
bit confused (understandably. And maybe a bit panicked too...), and
may have forgotten what you said at the top of this posting ;-)
and why fdisk shows special
devices in /dev which the mount command says don't exist.

The mount command does say they exist. However it cannot ID the
filesystem on them. No wonder, since there isn't one there.
The user of the machine is getting worried as there's a lot of data on
this array, and of course, he has no backup.

Well, allways the same story. There is no excuse for not having
backup...
I'm at the point of taking the disks out and trying them in a machine
running FC4. Any ideas or suggestions please before I do that?

Mount /dev/md0 directly. It should have your ext3. However it is
important that you remove the bogus partition table. Easiest way to do
that is as follows:

0. (Optionally) disable unhelpful dmraid boot script
1. Get the thing to work again, then make full backup.
2. Degrade the array by setting sdb as faulty
3. remove sdb from array
4. Partition sdb with one large partition of type 0xfb
Reboot if fdisk could not get th kernel to reload the partition table.
5. make a degraded raid 1 on /dev/sdb1 as md1 (specify the
second disk as "missing" to mdadm)
6. make filesystem on /dev/md1 and copy all data over from /dev/md0
7. stop /dev/md0, and create similar partition to sdb on sda
Reboot if fdisk told you it could not reload the patrtition table.
8. Add /dev/sda1 to /dev/md1
9. Adjust /etc/fstab as needed

You should not have a partition on sda and one on sdb, both set to be
auto-started as /dev/md1 by the kernel.

BTW, you can do this whole operation with a Knoppix CD or memory stick,
you just need to load the RAID kernel modules manually.

Arno
 
Back
Top