Cluster hangs

Matt · Mar 24, 2004

I have a 2 node Win2K SP4 cluster I'm using for file
sharing. The backend is FC SAN storage (IBM FastT).
Occasionally, the one node simply hangs. Not a blue
screen, but it is completely unresponsive to anything
except a ping - even the console is useless.

Somehow the other cluster node can't figure out that the
1st one is hung, so my file shares become unavailable. I
have to hard reset ("white button reboot") the 1st node in
order to get the resources to fail over, and to "fix" the
1st node.

There isn't squat in the system event log. The cluster
log has a couple of messages as follows right ahead of the
failure:

File Share <XXXXXXX>: Share has gone offline, Error=64 !

I've scoured Technet and Google and haven't been able to
find anything relative to this message.

Anyone have any ideas where else I can look?

TIA,
Matt

John Toner [MVP] · Mar 24, 2004

Matt,

net helpmsg 64 = "The specified network name is no longer available."

What are you pinging? Host IP? How about cluster IP? Can clients still
access the file share resources before the "hard reset"?

This is likely an issue with the host, rather than an issue with
cluster...unless the "hanging" is only occurring with Cluster Administrator.
If the whole host is hung, that's indicative of other issues.

Regards,
John

Guest · Mar 24, 2004

Users are not able to access the file shares untill the
failed node is reset. The functioning node seems to have
no idea that the failed node is gone, and therefore does
not assume control of the failed resources untill the
failed box is actually powered off.

The IP address that is pingable is the host IP of the
failed node. I did not try pinging the cluster group IP
of one of the failed groups.

I guess what I'm curious about is why if an entire node
has hung (not just clustering), how can the remaining node
possibly not know about it?

Jack Wang [MSFT] · Mar 26, 2004

Hi

Thank you for the update.

Since the node can be pinged, I think that the other node still assume that
the node is active. I suggest you unplug the network cable of the node to
check if the failover works.

Also, when you open the Cluster Administrator console on the working node,
could you see the node that hangs? If you find it, please check if you
could move the groups to this node manually.

In addition, you may use the MPSRPT_CLUSTER.EXE tool to collect the system
information of the node when it works.

Microsoft Product Support's Reporting Tools
http://www.microsoft.com/downloads/details.aspx?FamilyID=cebf3c7c-7ca5-408f-
88b7-f9c79b7306c0&DisplayLang=en

I am looking forward to your reply!

Sincerely,
Jack Wang, MCSE 2000, MCSA, MCDBA, MCSD
Microsoft Partner Support

Get Secure! - www.microsoft.com/security

=====================================================
When responding to posts, please "Reply to Group" via
your newsreader so that others may learn and benefit
from your issue.
=====================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
--------------------
| Content-Class: urn:content-classes:message
| From: <[email protected]>
| Sender: <[email protected]>
| References: <[email protected]>
<[email protected]>
| Subject: Re: Cluster hangs
| Date: Wed, 24 Mar 2004 10:36:18 -0800
| Lines: 70
| Message-ID: <[email protected]>
| MIME-Version: 1.0
| Content-Type: text/plain;
| charset="iso-8859-1"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: Microsoft CDO for Windows 2000
| X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300
| Thread-Index: AcQRzuWMNM6UYZ3TRSqJQapzyCJkAQ==
| Newsgroups: microsoft.public.win2000.advanced_server
| Path: cpmsftngxa06.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.win2000.advanced_server:17996
| NNTP-Posting-Host: tk2msftngxa13.phx.gbl 10.40.1.165
| X-Tomcat-NG: microsoft.public.win2000.advanced_server
|
| Users are not able to access the file shares untill the
| failed node is reset. The functioning node seems to have
| no idea that the failed node is gone, and therefore does
| not assume control of the failed resources untill the
| failed box is actually powered off.
|
| The IP address that is pingable is the host IP of the
| failed node. I did not try pinging the cluster group IP
| of one of the failed groups.
|
| I guess what I'm curious about is why if an entire node
| has hung (not just clustering), how can the remaining node
| possibly not know about it?
|
|
| >-----Original Message-----
| >Matt,
| >
| >net helpmsg 64 = "The specified network name is no longer
| available."
| >
| >What are you pinging? Host IP? How about cluster IP? Can
| clients still
| >access the file share resources before the "hard reset"?
| >
| >This is likely an issue with the host, rather than an
| issue with
| >cluster...unless the "hanging" is only occurring with
| Cluster Administrator.
| >If the whole host is hung, that's indicative of other
| issues.
| >
| >Regards,
| >John
| >
| >| >> I have a 2 node Win2K SP4 cluster I'm using for file
| >> sharing. The backend is FC SAN storage (IBM FastT).
| >> Occasionally, the one node simply hangs. Not a blue
| >> screen, but it is completely unresponsive to anything
| >> except a ping - even the console is useless.
| >>
| >> Somehow the other cluster node can't figure out that the
| >> 1st one is hung, so my file shares become unavailable.
| I
| >> have to hard reset ("white button reboot") the 1st node
| in
| >> order to get the resources to fail over, and to "fix"
| the
| >> 1st node.
| >>
| >> There isn't squat in the system event log. The cluster
| >> log has a couple of messages as follows right ahead of
| the
| >> failure:
| >>
| >> File Share <XXXXXXX>: Share has gone offline, Error=64 !
| >>
| >> I've scoured Technet and Google and haven't been able to
| >> find anything relative to this message.
| >>
| >> Anyone have any ideas where else I can look?
| >>
| >> TIA,
| >> Matt
| >
| >
| >.
| >
|

Cluster hangs

Matt

John Toner [MVP]

Guest

Jack Wang [MSFT]