ENABLE_PERSISTENT_CONFIG = TRUE PERSISTENT_CONFIG_DIR = /etc/condor/ral STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs StartJobs = False RalNodeOnline = FalseI use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.
STARTD_CRON_JOBLIST=TESTNODE STARTD_CRON_TESTNODE_EXECUTABLE=/usr/libexec/condor/scripts/testnodeWrapper.sh STARTD_CRON_TESTNODE_PERIOD=300s # Make sure values get over STARTD_CRON_AUTOPUBLISH = If_ChangedThe testnodeWrapper.sh script looks like this:
#!/bin/bash MESSAGE=OK /usr/libexec/condor/scripts/testnode.sh > /dev/null 2>&1 STATUS=$? if [ $STATUS != 0 ]; then MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/testnode.sh | head -n 1 | sed -e "s/=.*//"` if [[ -z "$MESSAGE" ]]; then MESSAGE=ERROR fi fi if [[ $MESSAGE =~ ^OK$ ]] ; then echo "RalNodeOnline = True" else echo "RalNodeOnline = False" fi echo "RalNodeOnlineMessage = $MESSAGE" echo `date`, message $MESSAGE >> /tmp/testnode.status exit 0This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.
condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false" condor_reconfig r21-n01 condor_reconfig -daemon startd r21-n01