DHCP Broken!

  • Thread starter Thread starter Brian Heil
  • Start date Start date
B

Brian Heil

We have been seeing some strange problems with Symantec's Ghost 8.x
Enterprise server and after some extensive testing have come to the
conclusion that both the Windows 2000 and 2003 Server DHCP server is
at fault.
Our situation is this:
Computer lab with 33 workstions (Dell Optiplex GX270s, on-board Intel
NIC, running Windows XP SP1)
Cisco Catalyst 4006 switch
Windows 2000 DHCP Server - essentially out of the box implementation,
runing on a Dell PowerEdge 2500 woth dual 1.4 Ghz PIIIs, and a gigi of
memory. We serve around 700 workstations in the entire building
(there are 760 available addresses and during peak use times, we have
roughly 100-150 addresses un-leased)

What we see with Ghost console sessions in this particular lab (and 2
more identical labs) is the random failure of 15-20% of the clients.
The clients report not being able to find the DHCP server.
Looking at some packet traces we see clients often received different
IP addresses with subsequent DISCOVERs. As many as 4 different IP
addresses were assigned to the same client during a ghost session.
Very often this was due to the clients detecting IP conflicts with
other clients in the lab and DECLINING the offered IP address.
Clients exhibiting this behavior fail with the 'DHCP server not found
message'.

After testing and eliminating wiring as cause, we isolated the room
and set up a non-windows DHCP server (same scope configuration).
Ghost sessions in this setup achieved 100% success rates.
Substituting an out of the box Windows 2000 DHCP server we again saw
the same percentage of client failures. Repeating the test with an
out of the box Windows 2003 DHCP server we saw the same failures
again.
I haven't done a thorough audit of the Windows XP event logs for these
machines, but a spot checked revealed that windows clients were
declining addresses, and seeing IP addresses conflicts. Eventually
these workstations did get a valid IP address.

It seems like the DHCP server isn't keeping client information or
keeping up with client state changes properly.
Anyone have any clue why the Windows 200x DHCP servers would behave
this way? And more importantly is there a fix, short of replacing our
Windows based DHCP server with a Unix flavor?

Thanks!
 
Brian Heil said:
We have been seeing some strange problems with Symantec's Ghost 8.x
Enterprise server and after some extensive testing have come to the
conclusion that both the Windows 2000 and 2003 Server DHCP server is
at fault.

While it is possible, you really should direct your
primary concern at the Ghost OR at the DHCP
CONFIGURATION and/or the network hardware.

That is, DHCP works fine for most people.
What we see with Ghost console sessions in this particular lab (and 2
more identical labs) is the random failure of 15-20% of the clients.
The clients report not being able to find the DHCP server.

What are you switch/router hardware connections like?

Usually such problems are a failure of the device to
relay the broadcasts properly....
Looking at some packet traces we see clients often received different
IP addresses with subsequent DISCOVERs. As many as 4 different IP
addresses were assigned to the same client during a ghost session.

How long does a session last? What are your lease periods?

IF the leases are not short (relative to the session) then
it is not likely a DHCP issue but rather something on the
machines RESETING the NIC, or RELEASING and
renewing the address configuration.
Very often this was due to the clients detecting IP conflicts with
other clients in the lab and DECLINING the offered IP address.

Do you by chance have multiple DHCP servers offering
addresses on the same subnet?

If so, did you (incorrectly) SPLIT the scope addresses
so that they do NOT overlap?

If so, make them overlap and EXCLUDE a portion of the
addresses from each server -- otherwise DHCP-A might
NAK for DHCP-B renewals etc.
Clients exhibiting this behavior fail with the 'DHCP server not found
message'.

That sounds like hardware or router again....

Same subnet? Are the requests (broadcasts being
forwarded properly to the DHCP Server?)
Our situation is this:
Computer lab with 33 workstions (Dell Optiplex GX270s, on-board Intel
NIC, running Windows XP SP1)
Cisco Catalyst 4006 switch
Windows 2000 DHCP Server - essentially out of the box implementation,
runing on a Dell PowerEdge 2500 woth dual 1.4 Ghz PIIIs, and a gigi of
memory. We serve around 700 workstations in the entire building
(there are 760 available addresses and during peak use times, we have
roughly 100-150 addresses un-leased)

After testing and eliminating wiring as cause, we isolated the room
and set up a non-windows DHCP server (same scope configuration).
Ghost sessions in this setup achieved 100% success rates.
Substituting an out of the box Windows 2000 DHCP server we again saw
the same percentage of client failures. Repeating the test with an
out of the box Windows 2003 DHCP server we saw the same failures
again.
I haven't done a thorough audit of the Windows XP event logs for these
machines, but a spot checked revealed that windows clients were
declining addresses, and seeing IP addresses conflicts. Eventually
these workstations did get a valid IP address.

There was such a bug back in WinNT 4.0 but it was long
since fixed and depended on booting a LOT (16/32) of
machines on a gang switch at precisely the same time.
It seems like the DHCP server isn't keeping client information or
keeping up with client state changes properly.

DHCP server is free to give a client a new address on renewal.

I don't see any indication that you have multiple DHCP servers
but that seems an easy thing to check and if so look to OVERLAP
addresses in the scope with exclusions (to avoid duplicates).
Anyone have any clue why the Windows 200x DHCP servers would behave
this way? And more importantly is there a fix, short of replacing our
Windows based DHCP server with a Unix flavor?

It works fine for (pretty much) everyone else so look to
fixing something else. My vote is the hubs/routers/switches
if any or something the clients are doing.

If it remains a hard problem then I would (we did for that
NT bug which I personally discovered while consulting
for Dell's test lab) put a Sniffer, NetMon, etc. on the line
and watch the DHCP conversations.
 
Oh, and the NT4 DHCP bug was on the CLIENTS,
so switching servers would not have helped even
then.
 
Herb Martin said:
If so, did you (incorrectly) SPLIT the scope addresses
so that they do NOT overlap?

If so, make them overlap and EXCLUDE a portion of the
addresses from each server -- otherwise DHCP-A might
NAK for DHCP-B renewals etc.

I like reading your posts and learn from them,...but could you explain what
you mean here? I suspect it is the same thing I tell people, but I'm just
not sure what you mean by it.
 
Phillip Windell said:
I like reading your posts and learn from them,...but could you explain what
you mean here? I suspect it is the same thing I tell people, but I'm just
not sure what you mean by it.

Thank you and certainly -- I didn't give the long
explanation since he may not even have the situation
much less that particular problem.

If multiple DHCP servers have scopes configured for
the same SUBNET we used to teach and configure that
the two scopes should NOT be overlapped.

That is wrong.

The two scopes should FULLY overlap the range of IP
addresses that each server considers to be within the
scope.

This of course would cause duplicate IP assignments
were the servers to actually distribute the same addresses,
so we must also exclude the complementary portions of
the addresses form each server, so that neither server
distributes the same addresses as the other.

We do it this way because having the addresses IN THE
SCOPE (even though excluded) tells that server NOT
to NAK renewals and requests from clients who receive
an address from the 'other' DHCP server.

Otherwise one server will possibly NAK (negatively
acknowledge) clients of the other server before the
server which actually distributes that address can
acknowledge or confirm the lease (renewal.)

It usually doesn't cause a catastrophic failure since the
clients will go back to discovering, requesting etc.

But it does cause unnecessary delays, extra network
traffic, and more frequent client address changes.
 
Herb Martin said:
Thank you and certainly -- I didn't give the long
explanation since he may not even have the situation
much less that particular problem.

------ snip to save space----------

Ok. Yes that is exactly what I always tell people. I always say configure
the scopes in each server identically and use Exclusions to prevent
duplicate addresses being given out.
 
Ok. Yes that is exactly what I always tell people. I always say configure
the scopes in each server identically and use Exclusions to prevent
duplicate addresses being given out.

Correct.

And you almost always have to explain why because
it both sounds goofy and disagrees with advice that
was usually given circa NT 4 and earlier.
 
While it is possible, you really should direct your
primary concern at the Ghost OR at the DHCP
CONFIGURATION and/or the network hardware.

I guess that's what I'm asking! What stuff might we need to reconfigure on the
DHCP server that would alleviate the problem?
That is, DHCP works fine for most people.

And for our Windows clients it appears to, but I suspect the way they are used
might mask what is really going on.
What are you switch/router hardware connections like?

Not sure exactly what you are asking here. There is fiber from the campus
backbone, to our switches, which are then 100BastT (CAT5) to the workstations.
Usually such problems are a failure of the device to
relay the broadcasts properly....


How long does a session last? What are your lease periods?
The sessions (at least up to the point where the clients should be connected)
are about 5 minutes. The sessions goes something like this:
- console sends command to reboot to DOS client
- DOS client reboots and requests IP - sometimes several times and does not get
same IP number on subsequent requests (or in fact the same as the IP it had
while booted to windows
- DOs client downloads virtual partition
- DOS client reboots and requests IP - again sometimes more than once, and does
not get the same IP on subsequent requests

Leases are 4 hours. The scope is a supernet of 3 class c subnets (netmask
255.255.252.0)
IF the leases are not short (relative to the session) then
it is not likely a DHCP issue but rather something on the
machines RESETING the NIC, or RELEASING and
renewing the address configuration.


Do you by chance have multiple DHCP servers offering
addresses on the same subnet?

This was I can definitely answer no to as we had an isolated situation with
known equipment (the ghost server, a DHCP server, a switch, the packet sniffer,
and the clients).
Same subnet? Are the requests (broadcasts being
forwarded properly to the DHCP Server?)
As far as we can see the broadcasts are making to the server.
It works fine for (pretty much) everyone else so look to
fixing something else. My vote is the hubs/routers/switches
if any or something the clients are doing.
This was my feeling too, but if it's not the Windows DHCP server specifically
broken, then it seems to be an interaction with the switch that the ISC DHCP
server doesn't exhibit.

The tests we ran used identical equipment except for the DHCP servers. Windows
2000, Windows 2003, and the latest stable release from ISC running on a linux
distribution. The only times we saw strange DHCP behavior was when we had the
Windows DHCP server in the mix.
 
Brian Heil said:
I guess that's what I'm asking! What stuff might we need to reconfigure on the
DHCP server that would alleviate the problem?

Probably nothing on the server if it is a single
server unless it is giving out addresses already
assigned manually, by another server, RAS server,
etc.
Not sure exactly what you are asking here. There is fiber from the campus
backbone, to our switches, which are then 100BastT (CAT5) to the
workstations.

That you have switches -- do they separate the troublesome
DHCP client(s) and the DHCP server?

Are you satisfied they are either relaying or passing the
DHCP broadcasts?

(NetMon, Sniffer etc if necessary)

The sessions (at least up to the point where the clients should be connected)
are about 5 minutes. The sessions goes something like this:
- console sends command to reboot to DOS client
- DOS client reboots and requests IP - sometimes several times and does not get
same IP number on subsequent requests (or in fact the same as the IP it had
while booted to windows

And that is perfectly normal. DHCP makes NO guarantee
that you will get the same IP on subsequent requests nor
even on a renewal, although getting the same one is common.
- DOs client downloads virtual partition
- DOS client reboots and requests IP - again sometimes more than once, and does
not get the same IP on subsequent requests

Leases are 4 hours. The scope is a supernet of 3 class c subnets (netmask
255.255.252.0)

So it is not a descrepancy between lease and session,
but rather reboots that show this?

That sounds (reasonably) meaningless to any problems.
This was I can definitely answer no to as we had an isolated situation with
known equipment (the ghost server, a DHCP server, a switch, the packet sniffer,
and the clients).

What about misconfigured manual clients, using
addresses assigned by DHCP?

Even if those clients are rogues etc....
As far as we can see the broadcasts are making to the server.
This was my feeling too, but if it's not the Windows DHCP server specifically
broken, then it seems to be an interaction with the switch that the ISC DHCP
server doesn't exhibit.

? hmm.
The tests we ran used identical equipment except for the DHCP servers. Windows
2000, Windows 2003, and the latest stable release from ISC running on a linux
distribution. The only times we saw strange DHCP behavior was when we had the
Windows DHCP server in the mix.

Can you assure that the address range was precisely the
same and that the other DHCP server is on the exact
same connection of the switches? If not, then I am still
voting for the Switch.

Else...

Ok, at that point we are about where I was in the Dell lab.

The NetMon traces showed that other problem clearly (but that
was Windows clients and fixed long ago.)
 
Back
Top