Cluster crashed, need recovery advice

  • Thread starter Thread starter Tonny van Geloof
  • Start date Start date
T

Tonny van Geloof

Hello everybody

I've got a serious problem with a W2K cluster.
I hope there is someone out there who can offer some advice.

I've got a SAN with 11 logical disks defined, which are accessed by a
cluster of 2 physical nodes.
1 disk is quorum, other 10 are data disks.
3 cluster groups:
1: only quorum drive
2: 5 of the datadisks,
3: The other 5 datadisks
Normally node 1 runs cluster group 1+2, and node 2 runs cluster group
3 for load balancing purposes.


Couple of days ago the Cluster Service itself crashed on node 2
causing a failiover of cluster group 3 to node 1.
So far no harm done.
When I noticed this (it happend overnight) I decided to reboot node 2.
Assuming everything would come back online, I would just have to do
a failover on cluster group 3.
To my horror the Cluster Service on node 1 died when node tried to
rejoin the cluster.
All 11 disks were gone.

After a lot of searching I figured out that somehow the administration
in the "Cluster Disk Device Driver" had gotten screwed.
Disablling that driver gave me back the disks and all data turned out
to be save.
Clearing the "Signatures" key in the registry and restarting
clusdisk.sys I still have access to those disks, as long as I do that
on 1 node while the other node is shut down. As soon as the second
node comes back online all disks immediatly disappear.
So I can run on one node only.
When I restart the Cluster Service all everything works but 2 of the
datadisks remain in fail status in the Cluster Admin, allthough those
disks show up under "My Computer" and are accessable without any
problems.
Eventually I just copied the content of those 2 disks to other
locations on the disks and I changed the various cluser-shares to
point to the new locations.

So I'm back online, with limited performance and no redundancy.

I will have to recover somehow...... I came up with the following
strategy:

- First make absolutely sure I have a good backup (duh)
- Remove those 2 fail disks from the Cluster Config.
- Shut the cluster service down
- Remove the disks from the Windows device manager.
- shutdown this node. startup the other node (without starting the
cluster service there) and remove the 2 dissk there as well.
- shutdown that node.
- On the SAN: Lowlevel format those 2 disks.
- Startup the first cluster node without enabling the cluster service
- Add the disks back to Windows, format, assign drive-letter
- Startup the second cluster node without enabling the cluster service
- Add the disks back to Windows, assign drive-letter
- Restart the cluster service on node 1and re-attach the 2 disks.
- Restart the cluster service on node 2
- Failover of cluster group 3 to node 2.

If anything goes wrong again I should be able to get at least back to
the 1 node cluster config I'm now running on.

Any comments, suggestions, bright ideas, advice are welcome.
 
When you "Shut down cluster service" you should also make sure that you
disable the cluster disk driver. You should also make sure that you only
have one node powered on when you're creating the disk. I'd modify your
procedure as follows:

- First make absolutely sure I have a good backup (duh)
- Remove those 2 fail disks from the Cluster Config.
- Shut the cluster service down *** and disable cluster disk driver ***
- Remove the disks from the Windows device manager.
- shutdown this node. startup the other node (without starting the cluster
service there) and remove the 2 disk there as well.
- *** Disable Cluster Disk driver and *** shutdown that node.
- On the SAN: Lowlevel format those 2 disks.
- Startup the first cluster node without enabling the cluster service
- Add the disks back to Windows, format, assign drive-letter
- *** re-enable cluster disk driver then power down first node ***
- Startup the second cluster node without enabling the cluster service
- Add the disks back to Windows, assign drive-letter
- *** re-enable cluster disk driver then shut down this node ***
- Restart the cluster service on node 1and re-attach the 2 disks.
- Restart the cluster service on node 2
- Failover of cluster group 3 to node 2.

Regards,
John
 
Once again, John, thank you very much for your input.

I was aware of the Cluster Disk Driver, but I was uncertain of the
proper sequence of doing things. You just cleared up for me.
(I was going to use the same sequence, since that seems the most
logical approach, but you never know...)

I will boot each node with the recovery console first, and the fibre
channel cables to the SAN disconnected, to make absolutely sure that
the driver and the cluster service are both disabled.
For the second node (that is now powered off) I have to anyway since I
can't recall (it was 4:00 AM) in what state I shut it down.

Regards

Tonny


When you "Shut down cluster service" you should also make sure that you
disable the cluster disk driver. You should also make sure that you only
have one node powered on when you're creating the disk. I'd modify your
procedure as follows:

- First make absolutely sure I have a good backup (duh)
- Remove those 2 fail disks from the Cluster Config.
- Shut the cluster service down *** and disable cluster disk driver ***
- Remove the disks from the Windows device manager.
- shutdown this node. startup the other node (without starting the cluster
service there) and remove the 2 disk there as well.
- *** Disable Cluster Disk driver and *** shutdown that node.
- On the SAN: Lowlevel format those 2 disks.
- Startup the first cluster node without enabling the cluster service
- Add the disks back to Windows, format, assign drive-letter
- *** re-enable cluster disk driver then power down first node ***
- Startup the second cluster node without enabling the cluster service
- Add the disks back to Windows, assign drive-letter
- *** re-enable cluster disk driver then shut down this node ***
- Restart the cluster service on node 1and re-attach the 2 disks.
- Restart the cluster service on node 2
- Failover of cluster group 3 to node 2.

Regards,
John
-- snip snip ----
 
Hi Everybody

Just wanted to let you know how I fared in getting my cluster back in
business.
Might help other people with similar problems.

I have eventually managed to determine how exactly things went wrong.
The fix turned out to be quite simple.

What happened was this:

The cluster service on node 2 crashed causing a failover.
So far no problem.
I rebooted node 2.

On startup the first thing the disk device driver does is to check if
there is a signature on the disk. If it's zero it will generate a fresh
signature.
This happens even before the clusdisk.sys is started and checks for
concurrent use can be done.

This siganture check has a major BUG:
Windows reads the first sector (the MBR) of the disk into a buffer
initialized to zero and then reads the 4 byte signature from that.
This read can return 3 possible states:
Ok: The read is valid.
Error: Sector is not readable (e.g. disk is bad)
Busy: Sector is not readable at this time because the disk, or bus, is
too busy to respond.

This last state (Busy) is not handled correctly by the Disk driver. It
assumess that any value other then Error equals OK.
So it will happily use the zero-bytes in the buffer as valid data and
will therefore think that the disk has no signature set.
So it will write a new signature to the disk. To make matters worse,
writes usually have priority over reads so that write will get through.

This is exactly what happened to me: The 2 disks with the heaviest load
turned up as Busy, got assigned a new signature and became invalid for
the other cluster node.

So the fix was fairly simple:
Write the original signmatures (obtained from the cluster's
registry-hive) back with dumpcfg.exe from the resource kit.
Clear the Signatures registry key in clusdisk.sys on both nodes and reboot.

Everything back online :-)


Regards

Tonny
 
I have not done much with clustering but could you explain how you found out
that The 2 disks with the heaviest load
turned up as Busy, got assigned a new signature and became invalid for
the other cluster node.

and how to do this "So the fix was fairly simple:
Write the original signmatures (obtained from the cluster's
registry-hive) back with dumpcfg.exe from the resource kit.
Clear the Signatures registry key in clusdisk.sys on both nodes and reboot."
 
sthompson said:
I have not done much with clustering but could you explain how you found out
that The 2 disks with the heaviest load
turned up as Busy, got assigned a new signature and became invalid for
the other cluster node.

and how to do this "So the fix was fairly simple:
Write the original signmatures (obtained from the cluster's
registry-hive) back with dumpcfg.exe from the resource kit.
Clear the Signatures registry key in clusdisk.sys on both nodes and reboot."

How I found out......

The Cluster service has a seperate registry hive which is in the file
\winnt\cluster\clusdb
When the service is not running you can open this in regedt32.exe.
When it is running you find the keys in \HKEY_LOCAL_MACHINE\CLUSTER in
the registry.

Under the resources of type "Physical disk" you will find the Signature
for each disk (a 4-byte hex number).

This is the signature that the Cluster wants to see in order to bring
the resource online.

In the Windows 2000 Server Resourcekit are 2 commandline tools:
- dumpcfg.exe which shows the current signatures of all disks in the
system and which can set a new signature on a disk.
- dmdiag which shows a lot of info on all disks in the system ,
including how the mappings between the various drivers are organised.

Comparing the cluster values to the dumpcfg outptu showed me the
discrepancy.

I know that teh only way to get a change in Signature for a disk is that
Windows somehow decides it's a new disk.

The company that I work for has access to the W2K sources. Analysing the
relevant parts of the disk-driver showed me the error in handling the
busy condition.

Fixing the signature itself was just using dumpcfg to write the original
sig back.

The only thing else to do is clearing the locking info in clusdisk.sys.

Disable that device driver. Reboot. In regsitry
CurrentControlSet\Clusdisk remove the Signatures key.
Re-anble the driver adnr eboot again.
This 2 be done on both nodes while the other node is down.
After this is done on both nodes you can start 1 node.
Restart the cluster service there.
When it is online you can restart the lcuster service on the second node.
 
Back
Top