Last Wednesday was a scheduled downtime for Lancaster in order to do the SL4 upgrade on the WNs, as well as some assorted spring cleaning of other services. We can safely say this was our worst upgrade experience so far, a single day turned into a three-day downtime. Fortunately for most, this was self-inflicted pain rather than middleware issues. The fabric stuff (PXE, kickstart) went fine, our main problem was getting consistent users.conf and groups.conf files for the YAIM configuration, especially with the pool sgm/prd accounts and the dns-style VO names such as supernemo.vo.eu-egee.org. The latest YAIM 3.1 documentation provides a consistent description but our CE still used the 3.0 version so a few tweaks were needed (YAIM 3.1 has since been released for glite 3.0). Another issue was due to our wise old CE (lcg-CE) having a lot of crust from previous installations, in particular some environment variables which affected the YAIM configuration such that the newer vo.d/files were not considered. Finally, we needed to ensure the new sgm/prd pool groups were added to torque ACLs but YAIM does a fine job with this should you choose to use it along with the _GROUP_ENABLE variables.
Anyway, things look good again with many biomed jobs, some atlas, dzero, hone and even a moderate number of lhcb jobs which supposed to have issues with SL4.
On the whole, the YAIM configuration went well although the VO information at CIC could still be improved with mapping requirements from VOMS groups to GID. LHCb provide a good example to other VOs, with explanations.
No comments:
Post a Comment