Help! win 2k3 cluster becoming unresponsive EMERGENCY!

  • Thread starter Thread starter gpalickar
  • Start date Start date
G

gpalickar

Hi everybody,

I am experiencing a serious issue with our new Win2003 Enterprise
Cluster. At random times, seemingly at higher workloads (we never
really see 20% utilization even at full tilt) our cluster stops
responding, we lose all our shares, and business grinds to a halt.

Now, we are running a Microsoft cluster, two servers, Top and Bot. The
cluster name is HAL. We need to run File Services for Macintosh
(sfmsrv.sys). Top is normally in control of the cluster. When we have
the issue, Hal just goes away, no win or mac shares are available.
Also, Top becomes unavailable on the network, you can't even map to the
administrative shares on the box. You can get on Top's console, but
you cannot open the Cluster manger, and you cannot open the Services
console to try to restart the cluster service or any other service for
that matter.

Now, you would think that the Bot server would take over, but that is
not that case. When you login to Bot, you still cannot open the
Cluster Manager to fail the cluster to Bot and get on with business.
The only way we can get HAL back online is to power down Top. Once Top
goes away, you can get Bot to take over and bring HAL back up.

The Mac File service is not supported in the cluster environment, so we
have to manually start the service on Bot so we can get our MAC shares
up. It seems to take a VERY long time (20 minutes or so) to get all
the shares available.

Any help at all would be appreciated, as we don't know when it will
fail on us again.
 
I dont know if anyone is looking at this, but I'll tell my story
anyway. It seems that at least some of our meltdowns are directly
related to the File Service for Macintosh (sfmsrv.sys) When we take
the cluster offling on the active node, all the shares go away. Thsi
is normal. When we bring the cluster online again, all the PC shares
come right back, almost instantly. The MAC shares are still
unavailable. We then can stop File Services for Mac. Then, when we
look in Computer Management, it still says that lots of macs are still
connected, we know that they are not. When we start File Services for
Mac again, the mac shares take a real long time to come up, and
sometimes the server will bomb before all of then appear.

We have been running for almost a week now, and i really think that you
just cannot restart sfmsrv without making the server unstable. Anyone
out there have something to add?
 
I dont know if anyone is looking at this, but I'll tell my story
anyway. It seems that at least some of our meltdowns are directly
related to the File Service for Macintosh (sfmsrv.sys) When we take
the cluster offling on the active node, all the shares go away. Thsi
is normal. When we bring the cluster online again, all the PC shares
come right back, almost instantly. The MAC shares are still
unavailable. We then can stop File Services for Mac. Then, when we
look in Computer Management, it still says that lots of macs are still
connected, we know that they are not. When we start File Services for
Mac again, the mac shares take a real long time to come up, and
sometimes the server will bomb before all of then appear.

We have been running for almost a week now, and i really think that you
just cannot restart sfmsrv without making the server unstable. Anyone
out there have something to add?

Are you aware that File Services for Macintosh is not cluster aware?

The reason your Mac services take a long time to come back online is
because somewhere in your process you're effectively crashing the
service and corrupting the Mac volume. When SFM restarts it's detecting
the corrupt volume and rebuilding the index. Depending on the number of
files in your volumes and the speed of your server, the indexing can
take a while (I've seen up to an hour for normal Mac volumes with up to
65,000 files).

If you need cluster-aware Mac file services I suggest you look at Group
Logic's ExtremeZ-IP. You can download a 30-day trial from
http://www.grouplogic.com.

Hope this helps! bill
 
Back
Top