This Morning I came in to find we were again failing the SAM tests, this time the ever-so-helpful
"Cannot plan: BrokerHelper: no compatible resources"
Pointing to a problem deep in the depths of the batch system. Looking at our queues (via showq), there were a lot of Idle jobs yet more than enough CPUs. The PBS logs revealed a new error message,
Cannot execute at specified host because of checkpoint orfor two of the jobs, eventually I managed to track it down to a node. Seeing as there wasn't any sign of the job file anymore, and pbs was refusing to re-run the job on another node, I had to resort to the trusty `qdel`, after thinking about it for the barest of moments, all of the Idle jobs woke up and started running.
Just for some gratuitous cross-linking, Steve Traylen appears to have provided a solution Over at the Scotgrid blog.