Wednesday 29 April 2009

NFS bug in older SL5 kernel

As mentioned previously ( http://northgrid-tech.blogspot.com/2009/03/replaced-nfs-servers.html ) we have recently upgraded our NFS servers and they now run on SL5. Shortly after going into production all LHCb jobs stalled at Manchester and we were blacklisted by the VO.

We were advised that it may be a lockd error, and asked to use the following python code to diagnose this:

-------------------------------------------------------------------
import fcntl
fp = open("lock-test.txt", "a")
fcntl.lockf(fp.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)
-------------------------------------------------------------------


The code did not give any errors and we therefore discounted this as the problem. Wind the clock on a fortnight (including a week's holiday over Easter) and we still have not found the problem so I tried the above code again, and bingo lockd was the problem. A quick search of the SL mailing list pointed me to this kernel bug
https://bugzilla.redhat.com/show_bug.cgi?id=459083

A quick update of the kernel and reboot and the problem was fixed.

No comments: