R
Ryan Bruins
I have an NLB (Network Load Balancing) cluster setup with
two Windows 2000 servers. Both servers have identical
hardware (Xeon 2.2, 1Gb RAM, Raid5, 2x 100Mb NIC) and
software configurations. Software being run is IIS5 and
Borland Socket Service (being used to call Custom COM+ RPC
applications over the internet).
The cluster has run great for about three months, then
suddenly one of the nodes (the first node on the cluster)
started acting strangely. The cluster balances requests
quite evenly, sending every other request to one or the
other server. The symptom we found is that every other
socket connection request was taking >100x longer than
normal. (ie. a query request that would normally run in 1
to 2 seconds, would take 30+ seconds on every other
request.)
We figured out that all of the requests that where taking
an extremely long time where going to Node1. The first
thing I checked is that there was no processes running on
Node1 taking up CPU or memory resources, there was nothing
unusual (under peak load, neither of the nodes currently
goes above 3% CPU or 25% physical memory). All of the
processes running on both servers appear identical. And I
double checked that the Database connections (CA/400 ODBC)
where setup identically, and had no problems connecting
from the console.
I initially thought it was a problem with my custom COM+
applications calling the queries, but then I watched the
Borland Socket Service monitor while I made requests to
the cluster and found something interesting. (running the
test at a time when I was the only user) On Node2 the
requests made to the socket service would open a socket
almost instantly when a request was made, the COM+ would
process, send a response and close the socket in about one
second. On Node1, I would make a request from a client, I
could see a quick blip of traffic on the Cluster NIC on
the Node1 server and then it would sit there with no
traffic on that NIC for almost 30 seconds before the
traffic would start again, and a socket connection would
appear on the Node1. The COM+ would run just as fast as it
did on Node2, but something was delaying the connection of
the socket.
It seems like the cluster is deciding that Node1 will take
that request, but for some reason there is a (30+ second)
delay before the node responds to the request. And the
weird thing is, there has not been any software or
hardware changes to either of the servers recently. The
change in performance on Node1 appears to be spontaneous??
I have checked, and double-checked that everything is
setup identically on both servers.
NOTE: if I make the same socket request to the Node1
server through the non-cluster NIC, it responds instantly.
We have temporarily disabled Node1 as a 30 second delay on
every other request is not acceptable to our users (the
delay was quickly noticed and complaints where swift and
plentiful) While the single server has more than enough
resources to handle all of the requests on it's own with
no noticeable effect on performance for the users, we
still need to get the Node1 server back up, as the purpose
of implementing NLB was to provide server redundancy
(instant failover).
If you have heard of this before, or have any suggestions,
please let me know. Any help will be much appreciated.
Ryan Bruins
Programmer/Analyst
IT Applications Team
Deeley Harley-Davidson Canada
two Windows 2000 servers. Both servers have identical
hardware (Xeon 2.2, 1Gb RAM, Raid5, 2x 100Mb NIC) and
software configurations. Software being run is IIS5 and
Borland Socket Service (being used to call Custom COM+ RPC
applications over the internet).
The cluster has run great for about three months, then
suddenly one of the nodes (the first node on the cluster)
started acting strangely. The cluster balances requests
quite evenly, sending every other request to one or the
other server. The symptom we found is that every other
socket connection request was taking >100x longer than
normal. (ie. a query request that would normally run in 1
to 2 seconds, would take 30+ seconds on every other
request.)
We figured out that all of the requests that where taking
an extremely long time where going to Node1. The first
thing I checked is that there was no processes running on
Node1 taking up CPU or memory resources, there was nothing
unusual (under peak load, neither of the nodes currently
goes above 3% CPU or 25% physical memory). All of the
processes running on both servers appear identical. And I
double checked that the Database connections (CA/400 ODBC)
where setup identically, and had no problems connecting
from the console.
I initially thought it was a problem with my custom COM+
applications calling the queries, but then I watched the
Borland Socket Service monitor while I made requests to
the cluster and found something interesting. (running the
test at a time when I was the only user) On Node2 the
requests made to the socket service would open a socket
almost instantly when a request was made, the COM+ would
process, send a response and close the socket in about one
second. On Node1, I would make a request from a client, I
could see a quick blip of traffic on the Cluster NIC on
the Node1 server and then it would sit there with no
traffic on that NIC for almost 30 seconds before the
traffic would start again, and a socket connection would
appear on the Node1. The COM+ would run just as fast as it
did on Node2, but something was delaying the connection of
the socket.
It seems like the cluster is deciding that Node1 will take
that request, but for some reason there is a (30+ second)
delay before the node responds to the request. And the
weird thing is, there has not been any software or
hardware changes to either of the servers recently. The
change in performance on Node1 appears to be spontaneous??
I have checked, and double-checked that everything is
setup identically on both servers.
NOTE: if I make the same socket request to the Node1
server through the non-cluster NIC, it responds instantly.
We have temporarily disabled Node1 as a 30 second delay on
every other request is not acceptable to our users (the
delay was quickly noticed and complaints where swift and
plentiful) While the single server has more than enough
resources to handle all of the requests on it's own with
no noticeable effect on performance for the users, we
still need to get the Node1 server back up, as the purpose
of implementing NLB was to provide server redundancy
(instant failover).
If you have heard of this before, or have any suggestions,
please let me know. Any help will be much appreciated.
Ryan Bruins
Programmer/Analyst
IT Applications Team
Deeley Harley-Davidson Canada