Thursday 3 July 2014

APEL EMI-3 upgrade

Here are some notes from Manchester upgrade to EMI-3 APEL. The new APEL is much simpler as it is a bunch of python scripts with a couple of key=value configuration files, rather than java scripts with XML files. It doesn't have YAIM to configure it but since it is much easier to install and configure it doesn't really matter anymore. As an added bonus I found that it's also much faster when it publishes and doesn't require any tedious tuning of how many records at the time to publish.

So Manchester starting point to upgrade was
  • EMI-2 APEL node
  • EMI-2 APEL parsers on EMI-3 cream CEs
    • We have 1 batch system per CE so I haven't tried a configuration in which there is only 1 batch system and multiple CEs
  • In few months we may move to ARC-CE so configuration was done mostly manually
I didn't preserve the old local APEL database since all the records are in the central APEL one anyway.  So the steps to carrie out were the following:
  1. Install a new EMI-3 APEL node
  2. Configure it 
  3. Upgrade the CEs parsers to EMI-3 and point them the new node
  4. Disable the old EMI-2 APEL node and backup its DB
  5. Run the parsers and fill the new APEL node DB
  6. Publish all records for the previous month from the new APEL machine

Install a new EMI-3 APEL node

Installed a vanilla VM with
  • EMI-3 repositories
  • Mysql DB
  • Host certificates
  • ca-policy-egi-core
I did this with puppet since all the bits and pieces were already there for other type of services I just put together the profile for this machine. Then manually I've installed the rpms for APEL
  • yum install --nogpg emi-release
  • yum install apel-ssm apel-client apel-lib

Configure EMI-3 APEL node

I followed the instructions on the official EMI-3 APEL server guide.

There are no tips here I've only changed the obvious fields Like site_name and password plus few others like the top BDII because we have a local one and the location of the hostcertificate because we have a different name.

I didn't install install the publisher cron job at this stage because the machine was not ready yet to publish

Upgrade the CEs parsers to EMI-3 and point them the new node

The CEs as I said are already on EMI-3, only the APEL parsers were still EMI-2 so I disabled the EMI-2 cron job
  • rm /etc/cron.d/glite-apel-pbs-parser  
Installed the EMI-3 APEL  parsers rpm
  • yum install apel-parser
Configured the parsers following the instructions on the official EMI-3 APEL parser guide setting the obvious parameters and installing also the cron job after a trial parsing test.

NOTE: the parser configuration file for me is a bit confusing regarding the batch system name it states

# Batch system hostname.  This does not need to be a definitive hostname,
# but it should uniquely identify the batch system.
# Example: pbs.gridpp.rl.ac.uk
lrms_server =


It seems you can use any name. You are of course better off using your batch system server name. We have one for each CE so the configuration file on each contains that. In the database this will identify the records from each machine CE. I'm not sure about what happens with 1 batch system and several CEs. Following literally one should put only the batch system but then there is no distinction between CEs.

Disable the old EMI-2 APEL node and backup its DB

Just removed the old cron job the machine is still running but it isn't doing anything while waiting to be decomissioned.

Run the parsers and fill the new APEL node DB

You will need to publish an entire month prior to when you are installing. For example for us it was publish all the June records, but since I didn't want to republish everything we had in the log files I moved the batch system and blah log files prior to mid May to a backup subdirectory and parsed only the log files for end of May June. May days were needed because some jobs that finished in June early days had started in May and one wants the complete record. The first jobs to finish in June in Manchester started on the 25th of May so you may want to go back a bit with the parsing.

Publish all records for the previous month from the new APEL machine

Finally on the new machine now filled with the June records plus some May I've done a bit of DB clean up as suggested by the APEL team. If you don't do this step the APEL team will do it centrally before stitching the old EMI-2 record and the new ones 
  • Delete from JobRecords where EndTime<"2014-06-01";
  • Delete from SuperSummaries where Month="5";
After all this I modified the configuration file (/etc/apel/client.cfg) to publish a gap from the 25th of May until the day before I published i.e. 1st of July. I then modified again to put back "latest". I finally installed the cron job also on the new APEL to publish regularly every day.