in my Northgrid talk at GridPP I pointed out we all do the same things but in a slightly different way I thought it'd be good to resume the thread on sharing management/monitoring tools. I always thought building a repository was a good thing and I still do.
I think the tools should be as generic as possible but do not need to be perfect. Of course if scripts work out of the box it's a bonus but they might be useful also to improve local tools with additional checks one might not have thought about.
I'll start with a couple of scripts I rewrote last Monday to make them more robust:
-- Check the BDII:
http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh
The original script was checking a network connection exist if it didn't exist it restarted the bdii service.
The new version checks the slapd is responsive, if it isn't checks if there is a hung process, if there is it kills it and restarts the bdii, if there isn't just restarts the bdii.
-- Check Host Certificate End Date:
http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/check-host-cert-date.sh
The old version was just checking if the certificate was expired and sent an alert. Not very useful in itself as it picks the problem when the damage is already done.
The old version checks that, because it might be useful if machines have been down for a while, and also it starts to send alerts 30 days before the expiration date. Finally if the certificate is not there it asks the obvious question should you be running this script on this machine?
No comments:
Post a Comment