Wednesday 30 November 2011

DPM upgrade 1.7.4 -> 1.8.2 (glite 3.2)

Last week I upgraded our DPM installation. It was a major change because I upgraded not only the DPM version but also the hardware and the backend mysql version.

I didn't take any measures this time before and after. I knew that becoming an alpha site in atlas was taking its toll on the old hardware and many of the timeouts were from gridftp but there had been a reappearance of the mysql ones I talked about in previous posts at the level that even restarting the service was hard.

[ ~]# service mysqld restart
Timeout error occurred trying to stop MySQL Daemon.

Stopping MySQL: [FAILED]

Timeout error occurred trying to start MySQL Daemon.


So I decided that the situation had become unsustainable and it was time to move to better hardware and software versions.

* Hardware: 2 cpu, 4GB mem, 2x250 GB raid1 -> 4 cores (HT on = 8 job slots), 24GB mem, 2x2TB raid1

There is no why here it was ok when we had limited access but the recent load was really too much for the old machine even with all the tuning. Suspected bad blocks on disks could be possible but no red leds nor hardware errors were reported by the machine.

* Mysql: 5.0.77 -> 5.5.10

Why mysql 5.5? Because InnoDB is the default engine and they have improved performance and instrumentation. On top of other things that we might actually start to use. A good blog article about the 5 reasons to move is this one: 5 good reasons to upgrade to mysql 5.5.

MySQL 5.5 is not in EPEL yet, but I found this CentOS community site that has the rpms and the instructions to install them.

After the installation I've also optimized the database partially with what I had already done in July, partly running a handy script mysqltuner.pl. This last one helps with variable you might not even know and even if you know them it tells you if they are too small. You need to be patient and let pass few hours before run it again.

* DPM: 1.7.4 -> 1.8.2

Why DPM 1.8.2 from glite 3.2? I would have gone for the UMD release or even the EMI one but then glite 3.2 was moved to production earlier than those and since I waited for this release since at least April I didn't think about it twice when I saw the escape route. It was really good timing too as it happened when I really couldn't postpone an upgrade anymore. You can find more info in the release notes. Among other reasons to upgrade: srmv2.2 in 1.7.4 has a memory leak which wasn't noticeable until the load was contained but for us exploded in October and is the reason I had to restart it every two days in the past few weeks.

Below the steps I took to reinstall the head node

On the old head node

* Set the site in downtime, drain the queues and kill all the remaining jobs.

* Turn off all the dpm and bdii services on the old head node

* Make a dump of the current database for backup

mysqldump -C -Q -u root -p -B dpm_db cns_db > dpm.sql-20111125.gz

* Download dpm-drop-requests-tables.sql supplied by Jean Philippe last July

wget http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-drop-requests-tables.sql

* Drop the requests tables. This step is really useful to avoid painful reload times as I said in this other post about DPM optimization and because it drastically reduces the size of ibdata1 when you reload which has also benefits (my ibdata1 was reduced from 26GB to 1.7GB). Still you need to plan because it might take few hours depending on the system. On my old hardware it took around 7 hours.

mysql -p < dpm-drop-requests-tables.sql

* Dump reduced version of the database

mysqldump -C -Q -u root -p -B dpm_db cns_db > dpm.sql-20111125-v2.gz


* Copy both to a WEB server where they can be downloaded from in a later stage.

* Update the local repository for DPM head node and DPM disk servers. Since it is still glite I just had to rsync the latest mirror to the static area.

On the new head node
* Install the new machines with a DPM head node profile. This was again easy since it is still glite no changes were required in cfengine.

* Most of the following is not standard and I put it in a script. If you have problems with users IDs created by avahi packages you can uninstall them with yum removing all the dependencies and let them be reinstalled by the bdii dependency chain. It should work also uninstalling them with rpm -e --nodeps. This leaves redhat-lsb (which is what the bdii depends on) untouched but I haven't tried this last method. Here are the commands I executed:

# Get the dpm DB file
rm -rf dpm.sql-20111125-v2.gz*
wget http://ks.tier2.hep.manchester.ac.uk/T2/tmp/dpm.sql-20111125-v2.gz


# Install mysql5.5
rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum -y remove libmysqlclient5 mysql mysql-*
yum -y clean all

yum -y install mysql55 mysql55-server libmysqlclient5 --enablerepo=webtatic

service mysql stop

rm -rf /var/lib/mysql/*

# Get the local my.cnf
cfagent -vq

service mysqld start


# Install the DPM rpms
yum -y remove cups avahi avahi-compat-libdns_sd avahi-glib
yum -y install glite-SE_dpm_mysql lcg-CA


# Modify sql scripts for mysql5.5

cd
/opt/lcg/share/DPM/
for a in create_dp*.sql; do sed -i.old 's/TYPE/ENGINE/g' $a;done
grep ENGINE *


# Run YAIM and upload old DB

cd

/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql


mysql -u root -p -C < /root/dpm.sql-20111125-v2.gz


# NECESSARY FOR THE FINAL UPDATES

/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql


* You will need to install the dpm-contrib-admintool rm because it is not in the glite repository it might be in the EMI one. Last time I heard it made it to ETICS. If you can't find it there's still the sysadmin repo version and related notes on the GridPP wiki (Sam or Wahid welcome to leave an update on this one).

* To upgrade the disk servers I just updated the repository, upgraded the rpms and rerun yaim.