Monday, 17 November 2014

Condor Workernode Heath Script

This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, I made use a a Condor feature; startd_cron jobs. I put this in my /etc/condor_config.local file on my worker nodes.
PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
StartJobs = False
RalNodeOnline = False

I use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.

# Make sure values get over
The script looks like this:


/usr/libexec/condor/scripts/ > /dev/null 2>&1

if [ $STATUS != 0 ]; then
  MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/ | head -n 1 | sed -e "s/=.*//"`
  if [[ -z "$MESSAGE" ]]; then

if [[ $MESSAGE =~ ^OK$ ]] ; then
  echo "RalNodeOnline = True"
  echo "RalNodeOnline = False"
echo "RalNodeOnlineMessage = $MESSAGE"

echo `date`, message $MESSAGE >> /tmp/testnode.status
exit 0

This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.
condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false"
condor_reconfig r21-n01
condor_reconfig -daemon startd r21-n01