[Dirvish] USB drives causing server lockups.(somewhat ontopic)
dirvish at rain4us.net
Thu Aug 21 04:51:54 UTC 2008
While I've seen my dirvish banks running on reiserfs formatted drives
get corrupt and lock up a server, I had never seen it with ext2/3
drives. I THOUGHT I had just run across an ext2/3 file system
corruption = server hang but now I'm beginning to wonder.
The dirvish restore to the new server hardware went smooth (and mostly
without hiccups -- there were a few drivers I had to compile for the new
SCSI card and nics) and I was looking forward to smooth sailing.
Unfortunately that hasn't been the case. I have been having issues on
this new hardware. Attempting heavy write access to a USB drive
containing one of my banks causes the server to lockup. The file
system contained errors and since I had a backup copy of the vaults in
that bank I decided to backup my vault configs and reformat the drive
fresh. I had previously reformatted reiserfs filesystems to 'fix'
corruption that caused lockups and I was surprised that I seemed to have
the same issue with ext2/3. When I kicked off a reformat on the dirvish
bank drive, the server wrote about 147 of it's inode allocations and
then the server just paused. At first the server was still pingable,
but that quickly deteriorated.
The numlock worked but the console was unresponsive. Use of the Magic
SysRq commands allowed me to Sync, Unmount and reBoot the server mostly
gracefully but now I am wondering what technical situations could lead
to a server hanging on USB disk access.
The 2.4.20 kernel that is running was stable on the old hardware(yes, I
know...that was the OLD hardware)...I fear that a kernel upgrade will be
necessary on this new hardware but I'm hoping someone else on the list
has seen a problem similar to this one and can offer suggestions. I am
not looking forward to dealing with getting an upgraded kernel patched ,
compatible and ready to run Dead-Gateway-Detection (DGD), mppe, and uml
processes only to find that the problem is hardware related, BIOS
setting related or some other such cause.
Troubleshooting steps taken so far include removing the add-on USB PCI
card and disabling SMP in the kernel( so that processes on the server
would quit going into Un-interruptible sleep mode(D))
More information about the Dirvish