Monday 18 June 2007

Flatline!


Last week was moderately annoying for the Lancaster CE, with hundreds of jobs immediately failing on WNs due to skewed clocks. The ntpd service was running correctly so we were in the dark about the cause. After trying to re-sync manually with ntpdate it was apparent something was wrong with the university ntp server, it only responded to a fraction of requests. Turned out to be problem with the server "ntp.lancs.ac.uk" which is an alias for these machines:
ntp.lancs.ac.uk has address 148.88.0.11
ntp.lancs.ac.uk has address 148.88.0.8
ntp.lancs.ac.uk has address 148.88.0.9
ntp.lancs.ac.uk has address 148.88.0.10

Only 148.88.0.11 is responding so I raised a ticket with ISS and look forward to a fix. In the meantime the server has been changed to 148.88.0.11 in the ntp.conf file managed by cfengine and it's rolled out without a problem.

Just to stick the boot in, an unrelated issue caused our job slots to be completely vacated over the weekend and we've started to fail Steve's Atlas tests. This is due to a bad disk on a node which went read-only. Need to find the exact failure mode in order to make yet another WN health check, this oneslipped past existing checks. :-( Currently at Michigan State Uni (Go Spartans!) for the DZero workshop and the crippled wireless net makes debugging painful.

No comments: