[Dirvish] USB drives causing server lockups.(somewhat ontopic)

Richard dirvish at rain4us.net
Thu Aug 21 04:51:54 UTC 2008

While I've seen my dirvish banks running on reiserfs formatted drives 
get corrupt and lock up a server, I had never seen it with ext2/3 
drives.   I THOUGHT I had just run across an ext2/3 file system 
corruption = server hang but now I'm beginning to wonder.

The dirvish restore to the new server hardware went smooth (and mostly 
without hiccups -- there were a few drivers I had to compile for the new 
SCSI card and nics) and I was looking forward to smooth sailing.  
Unfortunately that hasn't been the case.  I have been having issues on 
this new hardware.  Attempting heavy write access to a USB drive 
containing one of my banks causes the server to lockup.   The file 
system contained errors and since I had a backup copy of the vaults in 
that bank I decided to backup my vault configs and reformat the drive 
fresh.  I had previously reformatted reiserfs filesystems to 'fix' 
corruption that caused lockups and I was surprised that I seemed to have 
the same issue with ext2/3.  When I kicked off a reformat on the dirvish 
bank drive, the server wrote about 147 of it's inode allocations and 
then the server just paused.  At first the server was still pingable, 
but that quickly deteriorated.

The numlock worked but the console was unresponsive.   Use of the Magic 
SysRq commands allowed me to Sync, Unmount and reBoot the server mostly 
gracefully but now I am wondering what technical situations could lead 
to a server hanging on USB disk access.

The 2.4.20 kernel that is running was stable on the old hardware(yes, I 
know...that was the OLD hardware)...I fear that a kernel upgrade will be 
necessary on this new hardware but I'm hoping someone else on the list 
has seen a problem similar to this one and can offer suggestions.  I am 
not looking forward to dealing with getting an upgraded kernel patched , 
compatible and ready to run Dead-Gateway-Detection (DGD), mppe, and uml 
processes only to find that the problem is hardware related, BIOS 
setting related or some other such cause.

Troubleshooting steps taken so far include removing the add-on USB PCI 
card and disabling SMP in the kernel( so that processes on the server 
would quit going into Un-interruptible sleep mode(D))


