w2k cluster stopps serving resources unexpected.

  • Thread starter Thread starter goood
  • Start date Start date
G

goood

Hello,
we have two HP DL-Server running with W2k AS. Storage is the XP 512
from HP too. There are only fileshares on that cluster.
The problem is that both clusternodes stopps reacting on client
requests for fileservice unexpected with undefined times between the
errors without switching to other node. There is going just "nothing"
on the cluster. The cluster-service seems to run all the
time.(services-snapin).Typing "cluster res" on the command line
returns a normal output: all resources are online and on the correct
node. We need to restart both nodes to get the cluster up and serving
files again.

Here is a part of cluster.log concerning the last "crash-time":

00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Type=0 Context=12
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Collect Vote at Sequence=6119
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumVoteUpdate:
Dispatching vote type 0 context 12 to node 1
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Decision Routine returns=183
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Returning status=0
00000874.00000a64::2004/04/02-08:50:17.501 [DM]DmpCheckpointTimerCb-
taking a checkpoint
00000874.00000a64::2004/04/02-08:50:17.501 [LM] LogReset entry...
00000874.00000a64::2004/04/02-08:50:17.501 [LM] LogpReset entry...
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpCreate : Entry
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpMountLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.517 [LM]
LogpMountLog::Quorumlog File size=0x00000000
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpInitLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.532 [LM] LogpAppendPage :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpInitLog :
NextLsn=0x00000408 FileAlloc=0x00000800 ActivePageOffset=0x00000400
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpCreate : Exit with
success
00000874.00000a64::2004/04/02-08:50:17.564 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=104661
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogCheckPoint entry
00000874.00000a64::2004/04/02-08:50:17.610 [DM] DmpGetSnapShotCb:
DmpGetDatabase returned 0x00000000
00000874.00000a64::2004/04/02-08:50:17.610 [LM] DmpGetSnapshotCb:
Checkpoint file name=Q:\MSCS\chk17DC.tmp Seq#=6108
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
ChkPtFile=Q:\MSCS\chk17DC.tmp Chkpt Trid=6108 CheckSum=105158
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogFlush :
pLog=0x04511198 writing the 1024 bytes for active page at offset
0x00000400
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
EndChkpt written. EndChkPtLsn =0x00000438 ChkPt Seq=6108 ChkPt
FileName=Q:\MSCS\chk17DC.tmp
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogpCheckpoint :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogCheckPoint Exit
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=105158
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM] LogpReset exit,
returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM] LogReset exit,
returning 0x00000000
00000870.0000086c::2004/04/02-10:20:07.734

that's all. Q: is the quorum disk.

Did anybody have the same problem? Perhaps with different hardware?

rgds
R.J.
 
We had the same sort of issue. See if you have an error
in your event log with the event ID of
2011. "irpstakesize" to small. We called into tech
support and they pointed us to Article 177078
http://support.microsoft.com/default.aspx?scid=KB;EN-
US;q177078&

We were missing the irpstacksize parameter all
togeather. We added it the the registry rebooted and the
cluster has been stable since. Hope this helps

Michael

-----Original Message-----
Hello,
we have two HP DL-Server running with W2k AS. Storage is the XP 512
from HP too. There are only fileshares on that cluster.
The problem is that both clusternodes stopps reacting on client
requests for fileservice unexpected with undefined times between the
errors without switching to other node. There is going just "nothing"
on the cluster. The cluster-service seems to run all the
time.(services-snapin).Typing "cluster res" on the command line
returns a normal output: all resources are online and on the correct
node. We need to restart both nodes to get the cluster up and serving
files again.

Here is a part of cluster.log concerning the last "crash- time":

00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Type=0 Context=12
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Collect Vote at Sequence=6119
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumVoteUpdate:
Dispatching vote type 0 context 12 to node 1
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Decision Routine returns=183
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Returning status=0
00000874.00000a64::2004/04/02-08:50:17.501 [DM] DmpCheckpointTimerCb-
taking a checkpoint
00000874.00000a64::2004/04/02-08:50:17.501 [LM] LogReset entry...
00000874.00000a64::2004/04/02-08:50:17.501 [LM] LogpReset entry...
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpCreate : Entry
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpMountLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.517 [LM]
LogpMountLog::Quorumlog File size=0x00000000
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpInitLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.532 [LM] LogpAppendPage :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpInitLog :
NextLsn=0x00000408 FileAlloc=0x00000800 ActivePageOffset=0x00000400
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpCreate : Exit with
success
00000874.00000a64::2004/04/02-08:50:17.564 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=104661
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogCheckPoint entry
00000874.00000a64::2004/04/02-08:50:17.610 [DM] DmpGetSnapShotCb:
DmpGetDatabase returned 0x00000000
00000874.00000a64::2004/04/02-08:50:17.610 [LM] DmpGetSnapshotCb:
Checkpoint file name=Q:\MSCS\chk17DC.tmp Seq#=6108
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
ChkPtFile=Q:\MSCS\chk17DC.tmp Chkpt Trid=6108 CheckSum=105158
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogFlush :
pLog=0x04511198 writing the 1024 bytes for active page at offset
0x00000400
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
EndChkpt written. EndChkPtLsn =0x00000438 ChkPt Seq=6108 ChkPt
FileName=Q:\MSCS\chk17DC.tmp
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogpCheckpoint :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogCheckPoint Exit
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=105158
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM] LogpReset exit,
returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM] LogReset exit,
returning 0x00000000
00000870.0000086c::2004/04/02-10:20:07.734

that's all. Q: is the quorum disk.

Did anybody have the same problem? Perhaps with different hardware?

rgds
R.J.
.
 
We have had some similar issues. Our environment is 2 x
HP DL380G3s 2.5GB RAM, Active/Passive cluster, HP
(rebadged Brocade) SAN Fabric, HP MSA100 SAN serving 1200
users for file only, no print. You need to look in the
event logs but we were helped somewhat by the reg hack in
Microsoft Knowledge Base Article - 317249.

You should also look at any post SP4 KB articles as there
are a number of other potential causes.

Finally, make sure that XP Clients are at least SP1.
There are also a number of post SP1 fixes, mainly around
SMB.

The bad news is thatwhile our symptoms are relieved
somewhat, they are not resolved. A Premium Support call
to MS drew a blank after extensive investigation.
-----Original Message-----
We had the same sort of issue. See if you have an error
in your event log with the event ID of
2011. "irpstakesize" to small. We called into tech
support and they pointed us to Article 177078
http://support.microsoft.com/default.aspx?scid=KB;EN-
US;q177078&

We were missing the irpstacksize parameter all
togeather. We added it the the registry rebooted and the
cluster has been stable since. Hope this helps

Michael

-----Original Message-----
Hello,
we have two HP DL-Server running with W2k AS. Storage
is
the XP 512
from HP too. There are only fileshares on that cluster.
The problem is that both clusternodes stopps reacting
on
client
requests for fileservice unexpected with undefined
times
between the
errors without switching to other node. There is going just "nothing"
on the cluster. The cluster-service seems to run all the
time.(services-snapin).Typing "cluster res" on the command line
returns a normal output: all resources are online and
on
the correct
node. We need to restart both nodes to get the cluster up and serving
files again.

Here is a part of cluster.log concerning the
last "crash-
time":
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Type=0 Context=12
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Collect Vote at Sequence=6119
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumVoteUpdate:
Dispatching vote type 0 context 12 to node 1
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Decision Routine returns=183
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Returning status=0
00000874.00000a64::2004/04/02-08:50:17.501 [DM] DmpCheckpointTimerCb-
taking a checkpoint
00000874.00000a64::2004/04/02-08:50:17.501 [LM]
LogReset
entry...
00000874.00000a64::2004/04/02-08:50:17.501 [LM] LogpReset entry...
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpCreate : Entry
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpMountLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.517 [LM]
LogpMountLog::Quorumlog File size=0x00000000
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpInitLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.532 [LM] LogpAppendPage :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpInitLog :
NextLsn=0x00000408 FileAlloc=0x00000800 ActivePageOffset=0x00000400
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpCreate : Exit with
success
00000874.00000a64::2004/04/02-08:50:17.564 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=104661
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogCheckPoint entry
00000874.00000a64::2004/04/02-08:50:17.610 [DM] DmpGetSnapShotCb:
DmpGetDatabase returned 0x00000000
00000874.00000a64::2004/04/02-08:50:17.610 [LM] DmpGetSnapshotCb:
Checkpoint file name=Q:\MSCS\chk17DC.tmp Seq#=6108
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
ChkPtFile=Q:\MSCS\chk17DC.tmp Chkpt Trid=6108 CheckSum=105158
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogFlush :
pLog=0x04511198 writing the 1024 bytes for active page at offset
0x00000400
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
EndChkpt written. EndChkPtLsn =0x00000438 ChkPt
Seq=6108
ChkPt
FileName=Q:\MSCS\chk17DC.tmp
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogpCheckpoint :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogCheckPoint Exit
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=105158
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM] LogpReset exit,
returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM]
LogReset
exit,
returning 0x00000000
00000870.0000086c::2004/04/02-10:20:07.734

that's all. Q: is the quorum disk.

Did anybody have the same problem? Perhaps with different hardware?

rgds
R.J.
.
.
 
Hi,
My office has the similar problems. My environment is
2 x HP LXr8500 ( each with 2 PIII 700MHz CPU) 2GB RAM and
R/S 12 SEP Disk array. This is happening quite frequently.
Hope can resolve ASAP
-----Original Message-----
We have had some similar issues. Our environment is 2 x
HP DL380G3s 2.5GB RAM, Active/Passive cluster, HP
(rebadged Brocade) SAN Fabric, HP MSA100 SAN serving 1200
users for file only, no print. You need to look in the
event logs but we were helped somewhat by the reg hack in
Microsoft Knowledge Base Article - 317249.

You should also look at any post SP4 KB articles as there
are a number of other potential causes.

Finally, make sure that XP Clients are at least SP1.
There are also a number of post SP1 fixes, mainly around
SMB.

The bad news is thatwhile our symptoms are relieved
somewhat, they are not resolved. A Premium Support call
to MS drew a blank after extensive investigation.
-----Original Message-----
We had the same sort of issue. See if you have an error
in your event log with the event ID of
2011. "irpstakesize" to small. We called into tech
support and they pointed us to Article 177078
http://support.microsoft.com/default.aspx?scid=KB;EN-
US;q177078&

We were missing the irpstacksize parameter all
togeather. We added it the the registry rebooted and the
cluster has been stable since. Hope this helps

Michael

-----Original Message-----
Hello,
we have two HP DL-Server running with W2k AS. Storage
is
the XP 512
from HP too. There are only fileshares on that cluster.
The problem is that both clusternodes stopps reacting
on
client
requests for fileservice unexpected with undefined
times
between the
errors without switching to other node. There is going just "nothing"
on the cluster. The cluster-service seems to run all the
time.(services-snapin).Typing "cluster res" on the command line
returns a normal output: all resources are online and
on
the correct
node. We need to restart both nodes to get the cluster up and serving
files again.

Here is a part of cluster.log concerning the
last "crash-
time":
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Type=0 Context=12
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Collect Vote at Sequence=6119
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumVoteUpdate:
Dispatching vote type 0 context 12 to node 1
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Decision Routine returns=183
00000874.0000106c::2004/04/02-05:27:37.923 [GUM] GumSendUpdateOnVote:
Returning status=0
00000874.00000a64::2004/04/02-08:50:17.501 [DM] DmpCheckpointTimerCb-
taking a checkpoint
00000874.00000a64::2004/04/02-08:50:17.501 [LM]
LogReset
entry...
00000874.00000a64::2004/04/02-08:50:17.501 [LM] LogpReset entry...
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpCreate : Entry
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpMountLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.517 [LM]
LogpMountLog::Quorumlog File size=0x00000000
00000874.00000a64::2004/04/02-08:50:17.517 [LM] LogpInitLog : Entry
pLog=0x04511198
00000874.00000a64::2004/04/02-08:50:17.532 [LM] LogpAppendPage :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpInitLog :
NextLsn=0x00000408 FileAlloc=0x00000800 ActivePageOffset=0x00000400
00000874.00000a64::2004/04/02-08:50:17.548 [LM] LogpCreate : Exit with
success
00000874.00000a64::2004/04/02-08:50:17.564 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=104661
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.595 [LM] LogCheckPoint entry
00000874.00000a64::2004/04/02-08:50:17.610 [DM] DmpGetSnapShotCb:
DmpGetDatabase returned 0x00000000
00000874.00000a64::2004/04/02-08:50:17.610 [LM] DmpGetSnapshotCb:
Checkpoint file name=Q:\MSCS\chk17DC.tmp Seq#=6108
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
ChkPtFile=Q:\MSCS\chk17DC.tmp Chkpt Trid=6108 CheckSum=105158
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogFlush :
pLog=0x04511198 writing the 1024 bytes for active page at offset
0x00000400
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogCheckPoint:
EndChkpt written. EndChkPtLsn =0x00000438 ChkPt
Seq=6108
ChkPt
FileName=Q:\MSCS\chk17DC.tmp
00000874.00000a64::2004/04/02-08:50:17.642 [LM] LogpCheckpoint :
Writing 1024 bytes to disk at offset 0x00000000
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogCheckPoint Exit
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint::
Entry
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint:
ChkPt File Q:\MSCS\chk17DC.tmp ChkPtSeq=6108 ChkPtLsn=0x00000408
Checksum=105158
00000874.00000a64::2004/04/02-08:50:17.657 [LM] LogGetLastChkPoint
exit, returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM] LogpReset exit,
returning 0x00000000
00000874.00000a64::2004/04/02-08:50:17.673 [LM]
LogReset
exit,
returning 0x00000000
00000870.0000086c::2004/04/02-10:20:07.734

that's all. Q: is the quorum disk.

Did anybody have the same problem? Perhaps with different hardware?

rgds
R.J.
.
.
.
 
Back
Top