Wednesday, 17 December 2008

Manchester General Changes

* Enabled users pilots for Atlas and Lhcb. Currently Lhcb is running a lot of jobs and although most is production many are from their generic users. Atlas instead seems to have almost disappeared.

* Enabled NGS VO and passed the first tests. Currently in the conformance test week.

* Enabled one shared queue and completely phased out the VO queues. This has required a transition period for some VOs to give the time to clear the jobs from the old queue and or to reconfigure their tools. This has greatly simplified the maintainance.

* Installed a top-level BDII and reconfigured the nodes to query the local top level BDII instead of the RAL one. This was actually quite easy and we should have done it earlier.

* Cleaned up old parts of cfengine that were causing the servers to be overloaded, not serve the nodes correctly and fire off thousands of emails a day. Mostly this was due to an overlap in the way cfexecd was run both as a cron job and as a daemon. However we also increased TimeOut and SplayTime values and introduced explicitely the schedule parameter in cfagent.conf. Since then cfengine hasn't had anymore problems.

* Increased usage of YAIM local/post functions to apply local overrides or minor corrections to yaim default. Compared to inserting the changes in cfengine this method has the benefit of being integrated and predictable. When we run yaim the changes are applied immediately and don't get overridden.

* New storage: our room is full and when the clusters are loaded we hit the power/cooling limit and we risk to draw power from other rooms, due to this problem the CC people don't want us to switch on new equipment without switching off some old one to maitain the balance in the power consumption. So eventually we have bought 96 TB of raw space to get us going. The kit has arrived yesterday and needs to be installed in the rack we have and power measures need to be taken to avoid switching off more than the necessary amount of nodes. Luckily it will not be many anyway (even taking the nominal value on the back of the new machines it would be 8 nodes but with better power measures it could be as many as 4) because the new machines consume much less then the DELL nodes that are now 4 years old. However buying new CPUs/storage cannot be done without switching off a significant fraction of the current CPUs before switching on the new kit and requires working in tight cooperation with CCS people which has now been agreed after a meeting I had last week with them and their management.

No comments: