ARP issues with Windows 2003 NLB

  • Thread starter Thread starter Shan McArthur
  • Start date Start date
S

Shan McArthur

Hello,

I have been having problems setting up a NLB cluster. All cluster members
have a single network card and are configured in multicast mode. Each
machine has a dedicated IP address as well as a shared cluster address. The
problems that I have been having is that quite frequently, one of the
cluster members cannot communicate to another cluster member using it's
dedicated IP address. If I perform a network sniff of a ping from one
member to another, I see an ARP go out, but no ARP reply. When I look at
the ARP cache, the destination member server is not in it. If I go to the
destination server and ping the original server's dedicated IP address,
everything works. At that time, network communication is possible between
the cluster members using the dedicated IP. This continues to work until
the ARP cache expires and then communication is disrupted again.

Any ideas?

Thanks,
Shan McArthur
 
From network sniff, can you check if the destination machine is able to
receive and reply these ARP queries? Some medias don't support ARP requests
and responses. Do you see any other network box (a switch for example) which
is filtering ARP ? Do you see any proxy ARP happening? If none of this is
true, it is possible a bug.
Thank,
 
I can see my ARP request going out, but no reply. The problem is only with
connecting to other cluster members; connecting with other network devices
is working perfectly. There is no ARP filtering, and all network
communication is functioning perfectly with the exception of two cluster
members talking to each other using their dedicated host IP address.

I have also determined that the ARP request that is going out is using the
WRONG source IP address. The ARP request goes out with the "Sender's
Protocol Address" being the shared cluster address. I am presuming that
when the other cluster machine receives and processes the ARP request, that
it detects that it is it's own address (because the source address is using
the cluster address) and does not respond. The end result is that an ARP
response is not sent, and all IP communication is interupted between the two
machines. If I go to the other machine and ping the first machine, the
first machine adds the MAC address to the ARP cache, and then all
communication works until the ARP cache expires.

I am very confident that this is a bug in the Windows 2003 NLB
implementation. I have set the IP addresses up in the correct order, and
have configured NLB with the correct dedicated IP address. All other IP
communication goes out with the dedicated host IP address as the sender, but
ARP requests go out with the wrong "Sender's Protocol Address".

I can send you traces; just send me an email and tell me where you want them
sent.

Thanks,
Shan McArthur
 
Hi Shan,

The ARP is going to use the First Bound IP address on that NIC. This is a
configuration issue. You need to make sure that on Both NLB servers the
Dedicated IP address is listed as the first Bound IP address in the TCPIP
properties.

Thank you,

Alan Wood[MSFT]

This posting is provided "AS IS" with no warranties, and confers no rights.
 
The dedicated IP address is the address that is listed in the TCP/IP
properties page, and in the advanced properties it is the first in the list.
When normal IP packets go out, the dedicated IP is used for the sender IP
address, HOWEVER in ARP packets, the cluster address is used. When I type
IPCONFIG, the cluster address is listed first. In the registry under
tcpip/Parameters/Interfaces/{guid}, the dedicated IP address is the first in
the list of IPAddress.

What else can I do to make the dedicated IP address the primary address? I
am not aware of any other way to influence the order of the addresses. I
don't understand why the cluster address is being listed first in the
IPCONFIG results, and yet it is the first in every other list.

Thanks,
Shan McArthur
 
I have resolved the problem, but I am disturbed by the method I had to use:
I ended up rebooting the server, switching the IP address order, rebooting
again, removing the cluster IP, rebooting again, switching back on the basic
network properties tab, then going to the advanced tab and setting the
secondary.

Finally, my ARP requests are going out with the appropriate source address
and network communications is functioning properly. That said, the cluster
address is still listed first in ipconfig, but I am not going to complain
now that I have this all working.

Microsoft - you probably have a bug somewhere in the NLB code that is
affecting ARPs; this needs to be fixed. Ideally, there should be an easy
way to reorder IP addresses, or to specify the primary address other than
the current mechanism.

Shan
Shan McArthur said:
The dedicated IP address is the address that is listed in the TCP/IP
properties page, and in the advanced properties it is the first in the list.
When normal IP packets go out, the dedicated IP is used for the sender IP
address, HOWEVER in ARP packets, the cluster address is used. When I type
IPCONFIG, the cluster address is listed first. In the registry under
tcpip/Parameters/Interfaces/{guid}, the dedicated IP address is the first in
the list of IPAddress.

What else can I do to make the dedicated IP address the primary address? I
am not aware of any other way to influence the order of the addresses. I
don't understand why the cluster address is being listed first in the
IPCONFIG results, and yet it is the first in every other list.

Thanks,
Shan McArthur

"Alan Wood" said:
Hi Shan,

The ARP is going to use the First Bound IP address on that NIC. This is a
configuration issue. You need to make sure that on Both NLB servers the
Dedicated IP address is listed as the first Bound IP address in the TCPIP
properties.

Thank you,

Alan Wood[MSFT]

This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Shan,
More likley an issue with the Network Card you are using. Our IPstack
is designed as I stated to use the first bound IP address in the ARP
reply's as well. Also note that IPCONFIG parses and displays the
IPaddress list in reverse order. So, when using IPConfig to view the
bindings of the IP's you have the look from the bottom up.


Thank you,

Alan Wood[MSFT]

This posting is provided "AS IS" with no warranties, and confers no rights.
 
I used an Intel Pro 1000 network card that has a Windows 2003 Server logo
certification....

I am chalking this up to something spooky related to manually reconfiguring
NLB multiple times without rebooting. It seems to be working now.

Shan
 
Back
Top