W2K + IIS Loses Network Connection when Under Load

  • Thread starter Thread starter David Morgan
  • Start date Start date
D

David Morgan

Hello

This is a bit of a long one, but I would appreciate it if you would stick
with me and offer any advice you might consider useful.

We have a W2K Server being used purely as a web/database server using IIS
and SQL 2K. The server is fully patched. It is a IBM Dual Xeon 2.4Ghz 1Gb
RAM loads of disk in a RAID 1 and RAID 5 configuration. There are two
websites running on the machine, one of which is quite busy, ~15,000
visitors per day. The website consists of ASPs, Images and Movies.

The server has an Intel dual port 100Mbit NIC installed. The server is
hosted at an ISP and is protected by their firewall. The server is based in
the Netherlands while the target audience is in the UK.

As the number of visitors to the website started to grow we started to have
problems receiving pages that output a lot of HTML. The output would start
to be received and then stop. If you hit the stop button, partial content
would be displayed. The amount of content received always varied. ASPs with
small output continue to work fine, and often hitting refresh would allow
you to then receive the whole output from large files. Even when accessing
movie files over HTTP or FTP would also show this problem, for example,
downloading a movie file via HTTP would get to two percent and then stop.
Hitting stop and start in WMP controls would often allow you to receive a
bit more and so on through to completion. Similarly, when receiving files
over FTP the content would just stop coming.

We did a fair amount of network analysis on this problem and found nothing
untoward. Everything appeared to be ok, nothing weird logged in Event Logs,
IIS or in firewall logs.

I know we're not allowed to go on hunches but I think the problem is worse
when the server is under more network load as often overnight the
problematic pages would download first time. The problem has _never_
affected our ISP and other persons they have asked to test it, (most likely
based in the Netherlands). This problem seems to only affect people in the
UK on all networks, BT, NTL etc.

Using PingPlotter, (www.pingplotter.com), we could see that there was no
real problems getting to the server.

So, suspecting some sort of IIS corruption we re-installed the machine from
scratch, formatted both logical drives, re-installed W2K, SQL2K etc. The
problem still occurred.

Suspecting then that the Dual Port network card was too clever for it's own
good, we swapped it for a standard 3Com card. The problem still occurred.

We used Performance Monitor to monitor some things, but as envisaged the
server didn't appear to be under any great load and the disk queues and
pages/sec were ok. Thinking that maybe the server was a bit low on RAM, SQL
Server was consuming 800Mb of 1Gb, we decided to move SQL Server and the
database on to another machine connected via one of the Dual Port
interfaces. This, believe it or not, has made the problem worse.

As a last resort, we have plugged the unused Dual Port interface directly
into our ISPs switch, bypassing the Firewall, so to summarise, we have a
connection to the Internet via a firewall using a 3Com card, a direct cross
over connection to a SQL Server using one of the ports of the dual port card
and the other dual port interface connected directly to the Internet, no
firewall.

Now this is where it gets interesting. We immediately switched the URL for
movie downloads to the non-firewall internet connection and everything was
rosy, movies could finally be downloaded by everyone to completion first
time.

So we decided to change the DNS for our busy website to another IP Address
configured on the non-firewall internet connection. As the DNS change
started to take affect, we noticed the site kept going down and then coming
back up again. Sure enough, PingPlotter was telling us that every 5 minutes
or so, the server's network card was not responding for about 1 minute. The
interval between outages increased as we got later into the night, then in
the morning, started to become more frequent.

Needless to say we changed the DNS back to point our busy website's domain
at the dual port interface via the firewall. People are still downloading
Movies ok and we're still having periodic outages on the non-firewall
interface but only 2 or 3 per day.

What's interesting about the outages on the non-firewall interface is that
another host on that same subnet continued to have successful pings even
though people accessing via the Internet did not.

People accessing the website are still having problems with ASP files that
output large amounts of HTML and they're still refreshing to try and get the
whole file which sometimes works.

Unsurprisingly I have a couple of questions.

Has anyone experienced anything like this before?

Does anyone suspect like I do, that the two different symptoms, (via
firewall and not via firewall), could actually be the same problem but
behaving differently because of the firewall or something?

Whilst the usage of CPU, Memory etc. on the server seems to be quite low,
could the networking subsystem be having a problem that is not being
reported anywhere? Could moving SQL Server on to a different machine and
then connecting via network have accentuated the problem as the symptoms
suggest?

Are we expecting too much out of our server? Here is our usage graphs which
may mean more to you than me, but you will see the drop in usage as we have
switched to the other NIC interface and back again.
http://cobalt01.open-doors.nl/01/81.23.232.254_24.html

I should point out that our ISP is managing many hosts all with high
bandwidth utilisation including a very popular dating site and some audio
streaming sites. No problems of this nature has been reported to them other
than this one which they have finally given up on.

If I phone IBM or Microsoft surely they're just going to play one off
against the other... where do I go next?

Many thanks for reading this.

David M
 
Some people I have been speaking to have suggested that there is a problem
with the TCP/IP stack.

The fact that the host on the same subnet can communicate during an outage
for the rest of us implies that the server cannot talk to the gateway.

This person is alleging that this kind of thing has been very much improved
in Windows 2003 but of course I need evidence of this and the whether the
problem is TCP related.

Does anyone know what Performance Monitor counters I should be using to trap
problems of this nature with TCP?

Thanks

David
 
Back
Top