file system index corruption problem

  • Thread starter Thread starter ajsmith9870
  • Start date Start date
A

ajsmith9870

Windows 2000 server (SP4) -

HP Proliant ML370
Smart array 6400 controller
2 x 36gb 15k SCSI drives mirrored
Dual Xeon processors
4Gb RAM
Sophos AV
Around 30 concurrent users

The server is running in terminal services application mode and has
Citrix Presentation Server 4.0 (Standard) installed.

I have been experiencing a problem for months that I have been
struggling to get to the bottom of. At first the server began to
bluescreen daily. I changed the processors one at a time, and then
swapped out all the memory. Then, when we moved from Symantec AV to
Sophos the bluescreens stopped, but another problem arose. Symptoms
are that the server either stops accepting new logons, fails to load
users' default profiles at logon, or reports a virtual memory problem
for users already logged on. Any one / any combination of these. A
reboot always sorts the problem out, but only for half a day, to a day.
In the end I discovered that running chkdsk /f every night would
pretty much allow a full working day of uptime.

I think I now know what the problem is but am unable to get to the
cause. I noticed that when the server begins to exhibit the problems
mentioned above, running chkdsk cleans up index entries in the file
system and more often than not we can continue to work without a reboot
and chkdsk /f. I have changed the array controller and one of the
disks from the mirrored set but this has not fixed the problem. I have
now noticed that every time I run chkdsk it always deletes index
entries in Index $I30. That discovery led me to this article
http://support.microsoft.com/kb/885871 which got me questioning whether
the problem was a Windows 2000 issue rather than hardware.

Has anybody experienced similar problems / can anyone give me some
advice on how to troubleshoot before I spend any more money on
hardware?

Many thanks in advance,

Anthony
 
ajsmith9870 said:
Windows 2000 server (SP4) -

HP Proliant ML370
Smart array 6400 controller
2 x 36gb 15k SCSI drives mirrored
Dual Xeon processors
4Gb RAM
Sophos AV
Around 30 concurrent users

The server is running in terminal services application mode and has
Citrix Presentation Server 4.0 (Standard) installed.

I have been experiencing a problem for months that I have been
struggling to get to the bottom of. At first the server began to
bluescreen daily. I changed the processors one at a time, and then
swapped out all the memory. Then, when we moved from Symantec AV to
Sophos the bluescreens stopped, but another problem arose. Symptoms
are that the server either stops accepting new logons, fails to load
users' default profiles at logon, or reports a virtual memory problem
for users already logged on. Any one / any combination of these. A
reboot always sorts the problem out, but only for half a day, to a day.
In the end I discovered that running chkdsk /f every night would
pretty much allow a full working day of uptime.

I think I now know what the problem is but am unable to get to the
cause. I noticed that when the server begins to exhibit the problems
mentioned above, running chkdsk cleans up index entries in the file
system and more often than not we can continue to work without a reboot
and chkdsk /f. I have changed the array controller and one of the
disks from the mirrored set but this has not fixed the problem. I have
now noticed that every time I run chkdsk it always deletes index
entries in Index $I30. That discovery led me to this article
http://support.microsoft.com/kb/885871 which got me questioning whether
the problem was a Windows 2000 issue rather than hardware.

Has anybody experienced similar problems / can anyone give me some
advice on how to troubleshoot before I spend any more money on
hardware?

Many thanks in advance,

Anthony

I have experienced somewhat similar problems - they were caused
by a corruption of the file system after a crash. Chkdsk found lots
of errors and fixed them but the problem persisted. Rebuilding the
file system fixed the problems permanently.

Here is what I would do:
1. Temporarily install an 80 GByte IDE disk.
2. Partition & format it the same way as your RAID array.
3. Boot the machine with a Bart PE boot CD.
4. Use xcopy.exe to copy the existing disk to the new disk.
Make sure to copy system files, hidden files, attributes and ACLs.
5. Disconnect the RAID array.
6. Boot the machine with the IDE disk.
7. Run it like this for a week.
8. If all is well, reverse the process, starting by repartitioning
and formatting the RAID array.

This is a zero-risk method: Nothing is lost if the IDE disk won't work.
 
Hi Pegasus,

Thanks for your post. As you say it's a zero risk option, I'll give it
a try in the next couple of days.

Anthony
 
Update on the problem.

I struggled to make a bootable copy of the system due to the hardware
config (RAID 1 mirrored disks, hardware RAID), and so I have rebuilt
the server from scratch using a new set of hard drives, also a zero
risk option as I had the old drives to fall back on. Everything seemed
peachy for a day or two, but the corruption has now appeared again.
Example for chkdsk output -

Deleting index entry Default Outlook Profile.NK2 in index $I30 of file
53431.
Deleting index entry Default Outlook Profile.xml in index $I30 of file
53431.
Deleting index entry DEFAUL~1.NK2 in index $I30 of file 53431.
Deleting index entry DEFAUL~1.XML in index $I30 of file 53431.
Deleting index entry OutlPrnt in index $I30 of file 53431.
Index verification completed.

Errors found. CHKDSK cannot continue in read-only mode.

There is much more, but as it changes each time you run chkdsk but the
corruption is always with $I30 temporary files, I think this is enough
to give a flavour.

So far I have changed processors, RAID controller, and hard drives and
done a complete system rebuild. I found the following article, but I
am unsure if this is the cause of my trouble -

http://support.microsoft.com/kb/885871

If it is then I guess I'm pretty much sc**wed. Our main data storage
server is a Proliant DL380 with and MSA30 attached, running Windows
server 2003 Appliance Edition. Specs for the server with the problem
listed on first post. I wonder whether the only option I have left now
is to move to Windows Server 2003, something I was trying to avoid due
to the TS CAL's licensing requirements.

ANY ideas welcome at this stage.

Anthony
 
ajsmith9870 said:
Update on the problem.

I struggled to make a bootable copy of the system due to the hardware
config (RAID 1 mirrored disks, hardware RAID), and so I have rebuilt
the server from scratch using a new set of hard drives, also a zero
risk option as I had the old drives to fall back on. Everything seemed
peachy for a day or two, but the corruption has now appeared again.
Example for chkdsk output -

Deleting index entry Default Outlook Profile.NK2 in index $I30 of file
53431.
Deleting index entry Default Outlook Profile.xml in index $I30 of file
53431.
Deleting index entry DEFAUL~1.NK2 in index $I30 of file 53431.
Deleting index entry DEFAUL~1.XML in index $I30 of file 53431.
Deleting index entry OutlPrnt in index $I30 of file 53431.
Index verification completed.

Errors found. CHKDSK cannot continue in read-only mode.

There is much more, but as it changes each time you run chkdsk but the
corruption is always with $I30 temporary files, I think this is enough
to give a flavour.

So far I have changed processors, RAID controller, and hard drives and
done a complete system rebuild. I found the following article, but I
am unsure if this is the cause of my trouble -

http://support.microsoft.com/kb/885871

If it is then I guess I'm pretty much sc**wed. Our main data storage
server is a Proliant DL380 with and MSA30 attached, running Windows
server 2003 Appliance Edition. Specs for the server with the problem
listed on first post. I wonder whether the only option I have left now
is to move to Windows Server 2003, something I was trying to avoid due
to the TS CAL's licensing requirements.

ANY ideas welcome at this stage.

Anthony

I tried to duplicate the phenomenon described in kb885871 on
a Win2000 PC but was unable to. While I could generate some
file names containing DBCs, chkdsk did not report any of the
problems mentioned in the kb article.

If this was my server then I would do this:
- Run a recursive directory command to ascertain that there
are file names containing DBCs.
- If there are no such characters, look elsewhere for a cause
for this problem.
- If there are such names, write a renaming tool (probably in
C++ for good speed) that renames these files to a unique
string containing ASCII characters only.
 
Back
Top