Thursday 7 April 2011

Check BDII script updated

Yesterday the top BDII stopped working rather than the site BDII. It crashed. The pid file was still there but the process was not running.

So I adjusted the script to use a different query that works on all levels of bdii (resource, site, top) looking for o=infosys rather than o=grid and some specific attribute.

I also looked at the bdii startup script and it does a good job at cleaning up processes and lock/pid files in the stop function so I just use service bdii restart whether the process is there or not only the alert remains different in the two cases.

New version is still in

http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh

Monday 4 April 2011

Sharing scripts

in my Northgrid talk at GridPP I pointed out we all do the same things but in a slightly different way I thought it'd be good to resume the thread on sharing management/monitoring tools. I always thought building a repository was a good thing and I still do.

I think the tools should be as generic as possible but do not need to be perfect. Of course if scripts work out of the box it's a bonus but they might be useful also to improve local tools with additional checks one might not have thought about.

I'll start with a couple of scripts I rewrote last Monday to make them more robust:

-- Check the BDII:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh

The original script was checking a network connection exist if it didn't exist it restarted the bdii service.

The new version checks the slapd is responsive, if it isn't checks if there is a hung process, if there is it kills it and restarts the bdii, if there isn't just restarts the bdii.

-- Check Host Certificate End Date:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/check-host-cert-date.sh

The old version was just checking if the certificate was expired and sent an alert. Not very useful in itself as it picks the problem when the damage is already done.

The old version checks that, because it might be useful if machines have been down for a while, and also it starts to send alerts 30 days before the expiration date. Finally if the certificate is not there it asks the obvious question should you be running this script on this machine?