Introduction
Liverpool recently updated its cluster to SL6. In doing so, a problem occurred whereby the kernel would experience lockups during normal operations. The signs are
unresponsiveness, drop-outs in Ganglia and (later) many "task...blocked for 120 seconds" msgs in
/var/log/m.. and dmesg.
Description
Kernels in the range 2.6.32-431* exhibited a type of deadlock when run on certain hardware with BIOS dated after 8th March 2010.
This problem occured on Supermicto hardware, main boards:
Notes:
1) No hardware with BIOS dated 8th March 2010 or before showed this defect, even on the same board type.
2) The oldest kernel of the 2.6.32-358 range is solid. This is corroborated by operational experience with the 358 range.
3) All current kernels in the 2.6.32-431 range exhibited the problem on our newest hardware, and a few nodes of the older hardware that had had unusual BIOS updates.
Testing
The lock-ups
are hard to reproduce, but after a great deal of trail and error, a ~ 90% effective predictor was found.
The procedure is to:
- Build the system completely new in the usual way
and
- When yaim gets to "config_user", use a script (stress.sh) to run 36 threads
of gzip and one of iozone.
On a susceptible node, this is reasonably
certain to make it lock up after a minute. The signs are
unresponsiveness and (later) "task...blocked for 120 seconds" msgs in
/var/log/m.. and dmesg.
I observed that if the procedure is not followed "exactly", it is
unreliable as a predictor. In particular, if you stop Yaim and try
again, the predictor is useless.
To test that, I isolated the
config_users script from Yaim, and ran it separately along with the
stress.sh script. Result: useless - no lock-ups were seen.
Note: This result was rather unexpected because the isolated
config_users.sh script works in the same way as the original.
Unsuccessful Theories
A great many theories were tested and rejected or not pursued further (APIC problems, disk problems, BIOS differences,various kernels, examination of kernel logs, much googling etc. etc.) Eventually, a seemingly successful theory was stumbled upon which I describe below.
The Successful Theory
All our nodes had unusual vm settings:
# grep dirty /etc/sysctl.conf
vm.dirty_background_ratio = 100
vm.dirty_expire_centisecs = 1800000
vm.dirty_ratio = 100
These custom settings facilitate the storage of atlas "short files" in
RAM. Basically, they force files to remain off disk for a long time, allowing very fast access.
The modification had been tested almost exhaustively for several years on earlier
kernels - but perhaps some change (or latent bug?) in the kernel had
invalidated them somehow.
We came up with the idea that the issue
originates in the memory operations that occur prior to
Yaim/config_users. This would explain why anything but the exact
activity created by the procedure might well not trigger the defect. We thought this could tally with the idea of the ATLAS "short
file" modifications in sysctl.conf. The theory is that these mods
set up the problem during the memory/read/write operations (i.e. the
asynchronous OS loading and flushing of the page cache).
To test this, I used the predictor on susceptible nodes , but without applying the ATLAS "short file" patch. Default vm settings were adopted instead.
Result
Very satisfying at last - absolutely no sign on the defect. As the ATLAS "short file" patch is not very beneficial given the current data traffic, we have decided to go back to default "vm.dirty" settings and monitor the situation carefully.