NLB - One Node Slowing Down

  • Thread starter Thread starter Ryan Bruins
  • Start date Start date
R

Ryan Bruins

I have an NLB (Network Load Balancing) cluster setup with
two Windows 2000 servers. Both servers have identical
hardware (Xeon 2.2, 1Gb RAM, Raid5, 2x 100Mb NIC) and
software configurations. Software being run is IIS5 and
Borland Socket Service (being used to call Custom COM+ RPC
applications over the internet).

The cluster has run great for about three months, then
suddenly one of the nodes (the first node on the cluster)
started acting strangely. The cluster balances requests
quite evenly, sending every other request to one or the
other server. The symptom we found is that every other
socket connection request was taking >100x longer than
normal. (ie. a query request that would normally run in 1
to 2 seconds, would take 30+ seconds on every other
request.)

We figured out that all of the requests that where taking
an extremely long time where going to Node1. The first
thing I checked is that there was no processes running on
Node1 taking up CPU or memory resources, there was nothing
unusual (under peak load, neither of the nodes currently
goes above 3% CPU or 25% physical memory). All of the
processes running on both servers appear identical. And I
double checked that the Database connections (CA/400 ODBC)
where setup identically, and had no problems connecting
from the console.

I initially thought it was a problem with my custom COM+
applications calling the queries, but then I watched the
Borland Socket Service monitor while I made requests to
the cluster and found something interesting. (running the
test at a time when I was the only user) On Node2 the
requests made to the socket service would open a socket
almost instantly when a request was made, the COM+ would
process, send a response and close the socket in about one
second. On Node1, I would make a request from a client, I
could see a quick blip of traffic on the Cluster NIC on
the Node1 server and then it would sit there with no
traffic on that NIC for almost 30 seconds before the
traffic would start again, and a socket connection would
appear on the Node1. The COM+ would run just as fast as it
did on Node2, but something was delaying the connection of
the socket.

It seems like the cluster is deciding that Node1 will take
that request, but for some reason there is a (30+ second)
delay before the node responds to the request. And the
weird thing is, there has not been any software or
hardware changes to either of the servers recently. The
change in performance on Node1 appears to be spontaneous??
I have checked, and double-checked that everything is
setup identically on both servers.

NOTE: if I make the same socket request to the Node1
server through the non-cluster NIC, it responds instantly.

We have temporarily disabled Node1 as a 30 second delay on
every other request is not acceptable to our users (the
delay was quickly noticed and complaints where swift and
plentiful) While the single server has more than enough
resources to handle all of the requests on it's own with
no noticeable effect on performance for the users, we
still need to get the Node1 server back up, as the purpose
of implementing NLB was to provide server redundancy
(instant failover).

If you have heard of this before, or have any suggestions,
please let me know. Any help will be much appreciated.

Ryan Bruins
Programmer/Analyst
IT Applications Team
Deeley Harley-Davidson Canada
 
Ryan,

NLB is not application aware, so it is very unlikely that it is responsible
for the application response time difference you are seeing. Here are
some ideas to get a handle on the cause:
* Make test requests to the virtual IP address locally on Node1 - this
bypasses the NLB driver completely since this is equivalent to
using localhost. If this request is slow then the problem is with the
application, IIS, the socket service, etc.
* Make test requests through a dedicated IP address configured on
the NLB adapter on Node 1 - running this request from a remote
client will traverse the NLB driver, but packets are passed up only
on Node1 and bypass the load-balancing logic. If local requests
are fast but requests to the dedicated IP are slow, I don't have a
theory to explain it...but then again I am not familiar with the
specifics of your situation.
If either of these test have a long response time then NLB is cleared
of the slowdown through the virtual IP address. If they are both slow,
then we need to dig into this further.

Post the results of your investigation if you are still stumped. I will
check back.

Cheers,
Chris

This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
| Content-Class: urn:content-classes:message
| From: "Ryan Bruins" <[email protected]>
| Sender: "Ryan Bruins" <[email protected]>
| Subject: NLB - One Node Slowing Down
| Date: Tue, 5 Aug 2003 17:34:01 -0700
| Lines: 77
| Message-ID: <[email protected]>
| MIME-Version: 1.0
| Content-Type: text/plain;
| charset="iso-8859-1"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: Microsoft CDO for Windows 2000
| X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300
| Thread-Index: AcNbsm7NLYWJgMroTnO5ibtF9caNHg==
| Newsgroups: microsoft.public.win2000.networking
| Path: cpmsftngxa06.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.win2000.networking:31275
| NNTP-Posting-Host: TK2MSFTNGXA11 10.40.1.163
| X-Tomcat-NG: microsoft.public.win2000.networking
|
| I have an NLB (Network Load Balancing) cluster setup with
| two Windows 2000 servers. Both servers have identical
| hardware (Xeon 2.2, 1Gb RAM, Raid5, 2x 100Mb NIC) and
| software configurations. Software being run is IIS5 and
| Borland Socket Service (being used to call Custom COM+ RPC
| applications over the internet).
|
| The cluster has run great for about three months, then
| suddenly one of the nodes (the first node on the cluster)
| started acting strangely. The cluster balances requests
| quite evenly, sending every other request to one or the
| other server. The symptom we found is that every other
| socket connection request was taking >100x longer than
| normal. (ie. a query request that would normally run in 1
| to 2 seconds, would take 30+ seconds on every other
| request.)
|
| We figured out that all of the requests that where taking
| an extremely long time where going to Node1. The first
| thing I checked is that there was no processes running on
| Node1 taking up CPU or memory resources, there was nothing
| unusual (under peak load, neither of the nodes currently
| goes above 3% CPU or 25% physical memory). All of the
| processes running on both servers appear identical. And I
| double checked that the Database connections (CA/400 ODBC)
| where setup identically, and had no problems connecting
| from the console.
|
| I initially thought it was a problem with my custom COM+
| applications calling the queries, but then I watched the
| Borland Socket Service monitor while I made requests to
| the cluster and found something interesting. (running the
| test at a time when I was the only user) On Node2 the
| requests made to the socket service would open a socket
| almost instantly when a request was made, the COM+ would
| process, send a response and close the socket in about one
| second. On Node1, I would make a request from a client, I
| could see a quick blip of traffic on the Cluster NIC on
| the Node1 server and then it would sit there with no
| traffic on that NIC for almost 30 seconds before the
| traffic would start again, and a socket connection would
| appear on the Node1. The COM+ would run just as fast as it
| did on Node2, but something was delaying the connection of
| the socket.
|
| It seems like the cluster is deciding that Node1 will take
| that request, but for some reason there is a (30+ second)
| delay before the node responds to the request. And the
| weird thing is, there has not been any software or
| hardware changes to either of the servers recently. The
| change in performance on Node1 appears to be spontaneous??
| I have checked, and double-checked that everything is
| setup identically on both servers.
|
| NOTE: if I make the same socket request to the Node1
| server through the non-cluster NIC, it responds instantly.
|
| We have temporarily disabled Node1 as a 30 second delay on
| every other request is not acceptable to our users (the
| delay was quickly noticed and complaints where swift and
| plentiful) While the single server has more than enough
| resources to handle all of the requests on it's own with
| no noticeable effect on performance for the users, we
| still need to get the Node1 server back up, as the purpose
| of implementing NLB was to provide server redundancy
| (instant failover).
|
| If you have heard of this before, or have any suggestions,
| please let me know. Any help will be much appreciated.
|
| Ryan Bruins
| Programmer/Analyst
| IT Applications Team
| Deeley Harley-Davidson Canada
|
|
|
|
 
Ryan,

I really thought that local testing would clarify the situation. Instead it
has made it murky. I agree that NLB is still suspect based on the symptoms
described, though I am at a loss for a cause. Note that disabling NLB
(unbinding it from the adapter) is a very big change. From the UI it is
just a matter of clicking on a checkbox. Under the covers all of the
protocol bindings are torn down and rebuilt. So there is more going on than
meets the eye...the only point I want to stress is that this test alone
isn't sufficient to implicate NLB.

Regarding your notation it appears that a "non-cluster NIC" is a test
against an IP address bound to another adapter (one that NLB is not bound
to)? If so, note that NLB is not in the software path for handling these
packets. That further indicates that the problem lies elsewhere.

We are at a point that I need the NLB configuration for both nodes to take
this investiagation any further. You generate this output by running the
command 'wlbs display' from the command line on each host and saving the
output in a file. I also need to know which IPs you tested with so that I
can associate the results below with IPs in the configuration output. Do
you mind posting this information? Alternatively, we can take this
investigation offline to alleviate any privacy concerns you might have.
It's your call... I will contact you if you want to take it offline.

Don't give up and rebuild the box yet. We've made progress given the
results below.

Cheers,
Chris

This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
| Content-Class: urn:content-classes:message
| From: "Ryan Bruins" <[email protected]>
| Sender: "Ryan Bruins" <[email protected]>
| References: <[email protected]>
<[email protected]>
| Subject: RE: NLB - One Node Slowing Down
| Date: Thu, 7 Aug 2003 16:41:05 -0700
| Lines: 245
| Message-ID: <[email protected]>
| MIME-Version: 1.0
| Content-Type: text/plain;
| charset="iso-8859-1"
| Content-Transfer-Encoding: 7bit
| X-Newsreader: Microsoft CDO for Windows 2000
| Thread-Index: AcNdPV6wNf1Jc64CRy+V0UVOoDJ8dw==
| X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300
| Newsgroups: microsoft.public.win2000.networking
| Path: cpmsftngxa06.phx.gbl
| Xref: cpmsftngxa06.phx.gbl microsoft.public.win2000.networking:31579
| NNTP-Posting-Host: TK2MSFTNGXA08 10.40.1.160
| X-Tomcat-NG: microsoft.public.win2000.networking
|
| Chris, thanks for the suggestion, I got some "interesting"
| results.
|
| What I ended up doing was creating a simple client program
| that only opened and closed a socket connection to one of
| the COM+ components on a user specified IP address (no RPC
| or query is called or run) and it would measure how long
| it would take to complete the open and close of the socket
| in milliseconds.
|
| I did several tests with this program in 6 different
| configurations:
|
| Test #1 From The Internet:
| Node1 Cluster NIC: 25000 to 40000 ms
| Node2 Cluster NIC: 60 to 70 ms
|
| Test #2 From Node2 Server (in the DMZ):
| Node2 Cluster NIC (Cluster IP): 15 to 20 ms
| Node2 Non-Cluster Nic: 15 to 20 ms
| Node1 Non-Cluster Nic: 17000 to 40000 ms
|
| Test #3 From Node1 Server (in the DMZ):
| Node1 Cluster Nic (cluster IP): 15000 to 40000 ms
| Node1 Non-Cluster Nic: 17000 to 40000 ms
| Node2 Non-Cluster Nic: 15 to 20 ms
| So far sounds like you are right,
| does not seem to be NLB...BUT:
| Node1 Localhost (127.0.0.1): 15 to 20 ms
|
|
| Test #4 Disabled "NLB" on Node1 cluster NIC and set non-
| cluster IP, connecting from Node1 in DMZ:
| Node1 Cluster Nic (non-cluster IP): 15 to 20 ms
| Node1 Non-Cluster Nic: 15 to 20 ms
|
| Test #5 NLB still disabled on Node1, connecting from Node2
| in DMZ:
| Node1 Cluster Nic (non-cluster IP): 15 to 20 ms
| Node1 Non-Cluster Nic: 15 to 20 ms
|
| Test #6 NLB re-enabled on Node1, connecting from Node1 in
| DMZ:
| Node1 Cluster Nic (non-cluster IP): 15000 to 40000 ms
| Node1 Non-Cluster Nic: 15000 to 40000 ms
|
| So it appears to work fine when NLB is disabled. When NLB
| is enabled the socket connections are delayed for both
| NIC's on Node1.
|
| These are 'interesting' results, but I am still stumped
| about the cause of the problem. It seems even more likely
| that something about NLB on Node1 is causing the problem
| (as I suspected), but how or why?
|
| Thanks for you help, if you (or anyone else) have any
| further suggestions it would be appreciated. I'm to the
| point the only thing I can think of doing is clearing OS
| off of the server, and re-installing Windows from scratch
| on it.
|
| Ryan Bruins
| Programmer/Analyst
| IT Applications Team
| Deeley Harley-Davidson Canada
 
Perfer not to post network settings for our servers
online, if you want to continue helping via e-mail send a
message to (e-mail address removed).

Knowlege Base Article 222079, seems to support that NLB
could affect all of the NIC's on the server: "When you
install NLB, the intermediate driver is actually installed
on all network adapters, not just the network adapter used
to provide the service." "Although NLB is implemented as a
driver, it is installed as a service, and services are
system-wide entities."

NLB does use the second NIC for "heart beat" messages
being sent between the two servers, so it seems resonable
that disabling NLB will affect both NIC's.

Thanks again for your help.

Ryan Bruins
 
Back
Top