S
screwed-up-man
Hi there,
we are having very annoying sporadic problems with a server we bought
recently.
The server hangs up sporadically every few days and if power-cycled
won't complete boot or complete POST for a few minutes (hangs, in some
cases screen has no signal).
I think this might be caused by a previous overheat (in december), but
symptoms are strange so please tell in case you have an alternative
diagnosis or a suggestion to fix it.
Unfortunately we bought this server without on-site assistance, and that
costs nonzero, and also since the problems happen very sporadically and
last only a few minutes we have troubles calling assistance.
The server was assembled by the shop, it is a rackmount 24-disks with
two big turbines on the side and two 60mm fans in correspondence of the
RAM modules 16x2GB, it is a Tyan i5000PW. The two small fans were turned
to push air outside, as ther turbines are (air flow comes from through
the disks in the front). However the depression caused by the turbines
inside the chassis almost prevented air flow from the small fans, and I
believe the RAM modules overheated, or something else in that area. I
think I received some "CRITICAL OVERHEAT FB-DIMM" error messages from
Linux kernel once in december. We were using it in a normal room at
normal temperature, not yet moved to the conditioned servers' room.
When this happened I realized that the RAM modules were overheating and
I turned the small 60mm fans to push air towards the inside. Now RAM
modules are cold enough: max 67 degrees Celsius (lm-sensors i5k_amb) in
normal environment temperature. Then we shut down the server for the
remainder of december.
In January I launched memtest86+ v2.01 and that went for 1 day without
errors! (that's why I am not fully convinced of RAM problems) Is
memtest86+ able to properly disable the ECC in 5000P Blackford? Would
you trust memtest results 100%?
Still, yesterday the computer hanged again and when I power cycled it,
it wouldn't boot or even post (no screen signal) for 5 minutes, until I
opened it (it probably cooled down a bit, or was it some spurious data
in silicon that cleared?) and then it booted again. Then I closed it and
it is still running today!!
What the heck?
Frankly, we do have an additional identical mainboard. I might replace
it (supposing it's broken, which I am not sure) but if I return the
"broken" one to the shop they probably won't replace it, because it runs
correctly 99% of the time. This is a problem for us because we would
like to keep a spare mainboard for the moment where nothing compatible
is in production anymore. We are a research entity and the funding for
that project ended already, so we cannot buy another one, so we prefer
not to waste 1 mainboard. Ok ok if I cannot get rid of these hangups I
will eventually change the mainboard and see.
Another thing I could make is to turn the fans again and make it
overheat again so make it (hopefully) predictably fail, but I fear I
would damage something useful (we e.g. do not have spare RAM modules).
I am also thinking about BIOS settings. Is there anything that make it
behave like this? That would be the best news.
Yesterday I changed these settings
Installed OS: Win2K/XP ---> Other (we use Linunx Ubuntu 8.04)
Large Disk Access Mode: DOS ---> Other (maybe I did this wrong
according to the manual)
SERR signal condition: Single bit ---> Both
System Event Logging: ---> Reset Log (wouldn't show anything... but it
still doesn't)
Parallel Port: Enabled ---> Disabled
Then in Linux I disabled these modules that were continuously giving
erroneous errors (was like this since always, also in an identical
machine we have in another building)
i5000_edac
edac_core
Thank you for any help. As you imagine this was supposed to be our main
server for the foreseeable future and we are not full of money as we are
a research entity, and that funding has ended. We would really need to
have this server working... :-(
I might follow up with more information slowly in this thread (even
across a few weeks) when new things happen.
we are having very annoying sporadic problems with a server we bought
recently.
The server hangs up sporadically every few days and if power-cycled
won't complete boot or complete POST for a few minutes (hangs, in some
cases screen has no signal).
I think this might be caused by a previous overheat (in december), but
symptoms are strange so please tell in case you have an alternative
diagnosis or a suggestion to fix it.
Unfortunately we bought this server without on-site assistance, and that
costs nonzero, and also since the problems happen very sporadically and
last only a few minutes we have troubles calling assistance.
The server was assembled by the shop, it is a rackmount 24-disks with
two big turbines on the side and two 60mm fans in correspondence of the
RAM modules 16x2GB, it is a Tyan i5000PW. The two small fans were turned
to push air outside, as ther turbines are (air flow comes from through
the disks in the front). However the depression caused by the turbines
inside the chassis almost prevented air flow from the small fans, and I
believe the RAM modules overheated, or something else in that area. I
think I received some "CRITICAL OVERHEAT FB-DIMM" error messages from
Linux kernel once in december. We were using it in a normal room at
normal temperature, not yet moved to the conditioned servers' room.
When this happened I realized that the RAM modules were overheating and
I turned the small 60mm fans to push air towards the inside. Now RAM
modules are cold enough: max 67 degrees Celsius (lm-sensors i5k_amb) in
normal environment temperature. Then we shut down the server for the
remainder of december.
In January I launched memtest86+ v2.01 and that went for 1 day without
errors! (that's why I am not fully convinced of RAM problems) Is
memtest86+ able to properly disable the ECC in 5000P Blackford? Would
you trust memtest results 100%?
Still, yesterday the computer hanged again and when I power cycled it,
it wouldn't boot or even post (no screen signal) for 5 minutes, until I
opened it (it probably cooled down a bit, or was it some spurious data
in silicon that cleared?) and then it booted again. Then I closed it and
it is still running today!!
What the heck?
Frankly, we do have an additional identical mainboard. I might replace
it (supposing it's broken, which I am not sure) but if I return the
"broken" one to the shop they probably won't replace it, because it runs
correctly 99% of the time. This is a problem for us because we would
like to keep a spare mainboard for the moment where nothing compatible
is in production anymore. We are a research entity and the funding for
that project ended already, so we cannot buy another one, so we prefer
not to waste 1 mainboard. Ok ok if I cannot get rid of these hangups I
will eventually change the mainboard and see.
Another thing I could make is to turn the fans again and make it
overheat again so make it (hopefully) predictably fail, but I fear I
would damage something useful (we e.g. do not have spare RAM modules).
I am also thinking about BIOS settings. Is there anything that make it
behave like this? That would be the best news.
Yesterday I changed these settings
Installed OS: Win2K/XP ---> Other (we use Linunx Ubuntu 8.04)
Large Disk Access Mode: DOS ---> Other (maybe I did this wrong
according to the manual)
SERR signal condition: Single bit ---> Both
System Event Logging: ---> Reset Log (wouldn't show anything... but it
still doesn't)
Parallel Port: Enabled ---> Disabled
Then in Linux I disabled these modules that were continuously giving
erroneous errors (was like this since always, also in an identical
machine we have in another building)
i5000_edac
edac_core
Thank you for any help. As you imagine this was supposed to be our main
server for the foreseeable future and we are not full of money as we are
a research entity, and that funding has ended. We would really need to
have this server working... :-(
I might follow up with more information slowly in this thread (even
across a few weeks) when new things happen.