- Monitoring improvements - I've configured Nagios and John Bland is rolling out Ganglia, both of which have already proved very useful. We're also continuing to work on improving environmental monitoring here, particularly as relates to detecting failures in the water-cooling system.
- Significant hardware maintenance, including replacing two failed Dell Powerconnect 5224 switches in a couple of the water-cooled racks with new HP Procurve 1800s - more difficult than it should be due to the water-cooling design - and numerous node repairs.
- Network topology improvements, including installation of a new firewall/router.
Most of this week was spent trying to identify the reason why Steve Lloyd's ATLAS tests were mostly being aborted and why large numbers of ATLAS production jobs were failing here, mostly with the EXECG_GETOUT_EMPTYOUT error. I eventually identified the main problem as being with the existing ssh configuration on our batch cluser, where a number of host keys for worker nodes were missing from the CE. This (along with a couple of other issues) has now been fixed, and hopefully we'll see a large improvement in site efficiency as a result.
While investigating this, I also noticed a large number of defunct tar processes left over on multiple nodes by the atlasprd user, which had been there for up to 16 days. We're not sure what caused these processes to fail to exit, so any insights on that would be welcome.
Finally, Paul Trepka has been bringing up a new deployment system for the LCG racks - see him for details.
1 comment:
We see the atlasprd rogue tar processes too, you may want to implement a culling procedure as described here:
Processes_On_Batch_Nodes
Post a Comment