Trying to stabilize Sheffield cluster.
After the scheduled power outage the nodes didn't restart properly and some of the old jobs needed to be cleaned up. After that the cluster was ok apart from the BDII dropping out. We have applied the famous Kostas patch
https://savannah.cern.ch/bugs/?16625
which is getting into the release after 1.5 years Hurray!!!
The stability of the BDII has improved and DPM seems stable. The SAM tests have been stable over the weekend and the today Steve Atlas tests showed a 96% availability which is a big improvement. However the cluster filled up this morning and the instability reappeared, sign that there is still something to fix on the worker nodes and in the scheduling. Added a reservation for ops and looking at the WNs some of which were re-imaged this morning.
No comments:
Post a Comment