T
Tonny van Geloof
Hello everybody
I've got a serious problem with a W2K cluster.
I hope there is someone out there who can offer some advice.
I've got a SAN with 11 logical disks defined, which are accessed by a
cluster of 2 physical nodes.
1 disk is quorum, other 10 are data disks.
3 cluster groups:
1: only quorum drive
2: 5 of the datadisks,
3: The other 5 datadisks
Normally node 1 runs cluster group 1+2, and node 2 runs cluster group
3 for load balancing purposes.
Couple of days ago the Cluster Service itself crashed on node 2
causing a failiover of cluster group 3 to node 1.
So far no harm done.
When I noticed this (it happend overnight) I decided to reboot node 2.
Assuming everything would come back online, I would just have to do
a failover on cluster group 3.
To my horror the Cluster Service on node 1 died when node tried to
rejoin the cluster.
All 11 disks were gone.
After a lot of searching I figured out that somehow the administration
in the "Cluster Disk Device Driver" had gotten screwed.
Disablling that driver gave me back the disks and all data turned out
to be save.
Clearing the "Signatures" key in the registry and restarting
clusdisk.sys I still have access to those disks, as long as I do that
on 1 node while the other node is shut down. As soon as the second
node comes back online all disks immediatly disappear.
So I can run on one node only.
When I restart the Cluster Service all everything works but 2 of the
datadisks remain in fail status in the Cluster Admin, allthough those
disks show up under "My Computer" and are accessable without any
problems.
Eventually I just copied the content of those 2 disks to other
locations on the disks and I changed the various cluser-shares to
point to the new locations.
So I'm back online, with limited performance and no redundancy.
I will have to recover somehow...... I came up with the following
strategy:
- First make absolutely sure I have a good backup (duh)
- Remove those 2 fail disks from the Cluster Config.
- Shut the cluster service down
- Remove the disks from the Windows device manager.
- shutdown this node. startup the other node (without starting the
cluster service there) and remove the 2 dissk there as well.
- shutdown that node.
- On the SAN: Lowlevel format those 2 disks.
- Startup the first cluster node without enabling the cluster service
- Add the disks back to Windows, format, assign drive-letter
- Startup the second cluster node without enabling the cluster service
- Add the disks back to Windows, assign drive-letter
- Restart the cluster service on node 1and re-attach the 2 disks.
- Restart the cluster service on node 2
- Failover of cluster group 3 to node 2.
If anything goes wrong again I should be able to get at least back to
the 1 node cluster config I'm now running on.
Any comments, suggestions, bright ideas, advice are welcome.
I've got a serious problem with a W2K cluster.
I hope there is someone out there who can offer some advice.
I've got a SAN with 11 logical disks defined, which are accessed by a
cluster of 2 physical nodes.
1 disk is quorum, other 10 are data disks.
3 cluster groups:
1: only quorum drive
2: 5 of the datadisks,
3: The other 5 datadisks
Normally node 1 runs cluster group 1+2, and node 2 runs cluster group
3 for load balancing purposes.
Couple of days ago the Cluster Service itself crashed on node 2
causing a failiover of cluster group 3 to node 1.
So far no harm done.
When I noticed this (it happend overnight) I decided to reboot node 2.
Assuming everything would come back online, I would just have to do
a failover on cluster group 3.
To my horror the Cluster Service on node 1 died when node tried to
rejoin the cluster.
All 11 disks were gone.
After a lot of searching I figured out that somehow the administration
in the "Cluster Disk Device Driver" had gotten screwed.
Disablling that driver gave me back the disks and all data turned out
to be save.
Clearing the "Signatures" key in the registry and restarting
clusdisk.sys I still have access to those disks, as long as I do that
on 1 node while the other node is shut down. As soon as the second
node comes back online all disks immediatly disappear.
So I can run on one node only.
When I restart the Cluster Service all everything works but 2 of the
datadisks remain in fail status in the Cluster Admin, allthough those
disks show up under "My Computer" and are accessable without any
problems.
Eventually I just copied the content of those 2 disks to other
locations on the disks and I changed the various cluser-shares to
point to the new locations.
So I'm back online, with limited performance and no redundancy.
I will have to recover somehow...... I came up with the following
strategy:
- First make absolutely sure I have a good backup (duh)
- Remove those 2 fail disks from the Cluster Config.
- Shut the cluster service down
- Remove the disks from the Windows device manager.
- shutdown this node. startup the other node (without starting the
cluster service there) and remove the 2 dissk there as well.
- shutdown that node.
- On the SAN: Lowlevel format those 2 disks.
- Startup the first cluster node without enabling the cluster service
- Add the disks back to Windows, format, assign drive-letter
- Startup the second cluster node without enabling the cluster service
- Add the disks back to Windows, assign drive-letter
- Restart the cluster service on node 1and re-attach the 2 disks.
- Restart the cluster service on node 2
- Failover of cluster group 3 to node 2.
If anything goes wrong again I should be able to get at least back to
the 1 node cluster config I'm now running on.
Any comments, suggestions, bright ideas, advice are welcome.