Thursday, 9 September 2010

Manchester new hardware

We are in the process of installing the new hardware. I knew it was going to be compact but one thing is reading on paper that a 2U unit has 48 CPUs and can replace 24 of the old 1U machines, and another is seeing it. The old cluster 900 machines grandiosity half gone: 20 of the new machines replacing 450 of the old one in terms of cores. Our new little jewels. :)

The first 3 rows in each rack are the computing nodes, the machines at the bottom are the storage units. The storage also has become unbelievably compact and cheap. When we bought the DELL cluster 500TB was an enormity and extremely expensive if organised in proper data servers and this is why we tried to use the WNs disks. The new storage is 540TB of usable storage, fits in 9 4U machines and is considered commodity computing nowadays. Well... almost. ;)







Saturday, 4 September 2010

How to enable Atlas squid monitoring

Atlas has started to monitor squids with mrtg.

http://frontier.cern.ch/squidstats/indexatlas.html

mrtg uses snmp. So to enable the monitoring you need your squid instance compiled with --enable-snmp. CERN binaries are already compiled with that option the default squid coming with SL5 OS might not, your site centralised squid service might not. You don't need snmpd or snmptracd (net-snmp rpm) running to make it work.

Once you are sure the binary is compiled with the right options and that port 3401 is not blocked by any firewall you need to add these lines to squid.conf

acl SNMPHOSTS src 128.142.202.0/24 localhost
acl SNMPMON snmp_community public
snmp_access allow SNMPMON SNMPHOSTS
snmp_access deny all
snmp_port 3401


again if you are using the CERN rpms and the default frontier configuration you might not need to do that as there are already ACL lines for the monitoring.

Reload the configuration

service squid reload

Test it

snmpwalk -v2c -Cc -c public localhost:3401 .1.3.6.1.4.1.3495.1.1

you should get something similar to this:

SNMPv2-SMI::enterprises.3495.1.1.1.0 = INTEGER: 206648
SNMPv2-SMI::enterprises.3495.1.1.2.0 = INTEGER: 500136
SNMPv2-SMI::enterprises.3495.1.1.3.0 = Timeticks: (23672459) 2 days, 17:45:24.59


snmpwalk is part of net-snmp-utils rpm.

It takes a while for the monitoring to catch up. Don't expect an immediate response.

Additional information on squid/snmp can be found here http://wiki.squid-cache.org/Features/Snmp

NOTE: If you are also upgrading pay attention to the fact that in the latest CERN rpms the init script fn-local-squid.sh might try to regenerate your squid.conf.

Tuesday, 31 August 2010

Atlas jobs in Manchester

August has seen a really a notable increase of Atlas user pilot jobs. Over 34000 jobs of which more than 12000 just in the last 4 days. Plotting the number of jobs since the beginning of the year there has been an inversion between production and users pilots.



The trend in August was probably helped by moving all the space to the DATADISK space token and attracting more interesting data. LOCALGROUP is also heavily used in Manchester.

In the past 4 days we also have applied the XFS file system tuning suggested by John that solves the load on the data servers experienced since upgrading to SL5. The tweak has increased notably the data throughput and reduced the load on the data servers practically to zero allowing us to increase the number of concurrent jobs. This has allowed a bigger job throughput and has had a clear improvement on the job efficiency isolating as most inefficient the very short ones (<10 mins CPU time) and even then the improvement is also notable as it is possible to see from the plots below.


Before applying the tweak




After applying the tweak




This also means we can keep on using XFS for the data servers which has currently more flexibility as far as partition sizes are concerned.

Tuning Areca RAID controllers for XFS on SL5

Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).
blockdev --setra 16384 /dev/$RAIDDEVICEThis sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.The mystery for us is more why the SL4 systems *don't* overload rather than why SL5 does, as the SL4 systems use the exact same default values.Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.http://hep.ph.liv.ac.uk/~jbland/xfs-fix.htmlAny time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size withecho "64" > /sys/block/$RAIDDEVICE/queue/max_sectors_kbalthough we don't think this had any bearing on the overloading and might not be necessary. 
We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.

Wednesday, 18 August 2010

ext4 vs ext3 round(1)

I started yesterday to look at the ext4 vs ext3 performance with iozone. I installed two old dell WNs with the same file system layout, same raid level 0, but one with ext3 and one with ext4 on / and on /scratch the directories used by the jobs. Both machines have the default mount values and the kernel is 2.6.18-194.8.1.el5.

I performed the tests on /scratch partition writing the log in /. I did it twice one mounting and unmounting the fs at each test so to delete any trace of information from the buffer cache and one leaving the fs mounted between tests. Tests were automatically repeated for sizes from 64kB to 4GB and record length between 4kB - 16384kB. Iozone automatically doubles the previous sizes at each test (4GB is the smallest multiples smaller than the 5GB file size limit I set).

From the numbers ext4 performs much better in writing while reading is basically the same if not slightly worst for smaller files. There is a big drop in performance for both file systems for the 4GB size.

What however I find confusing is that I did the tests again setting the max size of the file at 100M and doing only write tests and ext3 takes less time despite (22 secs vs 44s in this case) despite the numbers saying that writing is almost 40% faster there is something that slows the tests down (deleting?). Speed of tests become similar for sizes >500MB they both decrease steadily until they finally drop at 4GB for any record length in both file systems.

Below some results mostly with the buffer cache because not having it affects mostly ext3 for small sizes of file and rec length as shown in the first graph.

EXT3: write (NO buffer cache)
==================



EXT3: write (buffer cache)
==================




EXT4: write (buffer cache)
==================




EXT3: read (buffer cache)
==================




EXT4: read (buffer cache)
==================


Wednesday, 4 August 2010

Biomed VOMS server CA DN has changed

Biomed is opening GGUS tickets for non working sites. Apparently they are getting organised and they have someone to do some sort of support now.

They opened one for Manchester too - actually we were slightly flooded with tickets we must have some decent data on the storage.

The problem turned out to be that on the 18/6/2010 the biomed VOMS server CA DN has changed. If you find these messages (if you google for them you get only source code entries) on your DPM srmv2.2 log files:

08/03 11:14:30 4863,0 srmv2.2: SRM02 - soap_serve error : [::ffff:134.214.204.110] (kingkong.creatis.insa-lyon.fr) : CGSI-gSOAP running on bohr3226.tier2.hep.manchester.ac.uk reports Error retrieveing the VOMS credentials


than you know you must update the configuration on your system replacing

/C=FR/O=CNRS/CN=GRID-FR

with

/C=FR/O=CNRS/CN=GRID2-FR

in

/etc/grid-security/vomsdir/biomed/cclcgvomsli01.in2p3.fr.lsc


Note: don't forget YAIM too if you don't want to override. I updated the YAIM configuration on the GridPP wiki

http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#IN2P3_VOMS_server_VOs

Tuesday, 27 July 2010

Moving APEL to SL5

We have moved APEL from the SL4 MON box to the SL5 version that works standalone without RGMA (finally!). The site BDII has also been transferred on this machine from the MON box. This is how it is set up.

*) Request a certificate for the machine if you don't have one already.
*) Kickstart a machine vanilla SL5, two raid1 disks.
*) Install mysql-server-5.0.77-3.el5 (it's in the SL5 repository)
*) Remove /var/lib/mysql and recreated it empty (you can skip this but I messed around with it earlier and needed a clean dir).
*) Start mysqld

service mysqld start

It will tell you at this point to create the root password.

/usr/bin/mysqladmin -u root password 'pwd-in-site-info.def''
/usr/bin/mysqladmin -u root -h <machine-fqdn> password 'pwd-in-site-info.def'


*) Install the certificate (we have it directly in cfengine).
*) Setup the yum repositories if your configuration tool doesn't do it already

cd /etc/yum.repos.d/
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-APEL.repo


*) Install glite-APEL

yum install glite-APEL

*) Run yaim: it sets up the database and most of all the ACL, if you have more than one CE to publish you need to run it for each CE changing the name of the CE in site-info.def or if you are skilled with SQL you need to set the permissions for each CE to have write access.

/opt/glite/yaim/bin/yaim -s /opt/glite/yaim/etc/site-info.def -c -n glite-APEL

*) BUG: APEL still uses JAVA. Anytime it is run it creates a JAVA key store with all the CAs and host certificate added to it. It might happen that on your machine you get the OS JAVA version and the one you install (normally 1.6). The tool used to create the keystore file is called by a script without setting the path so if you have both versions of the command it is likely that the OS one is called because it resides in /usr/bin. Useless to say the OS version is older and doesn't have all the options used in the APEL script. There are a number of ways to fix this I modified the script to insert absolute path, you can change the link target in /usr/bin or you can add a modified path to the apel cron job. The culprit script is this:

/opt/glite/share/glite-apel-publisher/scripts/key_trust_store_maker.sh

and belongs to

glite-apel-publisher-2.0.12-7.rpm

The problem is known and apparently a fix is in certification. My ticket is here

https://gus.fzk.de/ws/ticket_info.php?ticket=60452

*) Register the machine in GOCDB making sure you tick glite-APEL and not APEL to mark it as a service.

*) BUG: UK host certificates have an email attribute. This email has a different format in the output of different clients. When you register the machine put the host DN as it is. Then open a GGUS ticket for APEL so they can change it internally. This is also known and followed in this savannah bug but at the moment they have to change it manually. Below the savannah bug.

https://savannah.cern.ch/bugs/?70628


*) Dump the DB on on the old MON box with mysqldump. I thought I could tar it up but it didn't like it so I used this instead.

mysqldump -C -Q -u root -p accounting | gzip -c > accounting.sql.gz

*) Copy to and reload on the new machine

zcat accounting.sql.gz | mysql -u root -p accounting

*) Run APEL manually and see how it goes (command is in the cron job).

If you are happy with it go on with the last two steps, otherwise you have found an additional problem I haven't found.

*) Disable the publisher on the old machine, i.e. remove the cron job.

*) Modify parser-config-yaim.xml for all the CEs so they point to the new machine. The line to modify is

<DBURL>jdbc:mysql://<new-machine-fqdn>:3306/accounting</DBURL>

SWITCHING OFF RGMA

When I was happy with the new APEL machine I turned off RGMA and removed it from the services published by the BDII and the GOCDB. This caused the GlueSite object to disappear from our site BDII. You need to have the site BDII in the list of services published before you remove RGMA.

Tuesday, 29 June 2010

Occupying Lancaster's new data centre



Lancaster had a scheduled downtime yesterday to relocate half our storage and CPU to the new High End Computing data centre. The move went very smoothly and both storage and CPU are back online running ATLAS (and other) jobs. The new HEC facility is a significant investment from Lancaster University with the multi million pound building housing central IT facilities, co-location, and HEC data centres.

Lancaster's GridPP operations have a large stake in HEC with Roger Jones (GridPP/ATLAS) taking directorship of the facility. Our future hardware procurement will be located in this room which has a 35-rack capacity using water-cooled Rittel racks. Below are some photographs of the room as it looks today. Ten racks are in place with two being occupied by GridPP hardware and the remaining eight to be populated in July with £800k worth of compute and storage resource.





Tuesday, 20 April 2010

Scaling, capacity publishing and accounting

Introduction

Our main cluster at Liverpool is homogeneous at present, but that will change shortly. This document is part of our preparation for supporting a heterogeneous cluster. It will help us to find the scaling values and HEPSPEC06 power settings for correct publishing and accounting in a heterogeneous cluster that uses TORQUE/Maui without sub-clustering. I haven't fully tested the work, but I think it's correct and I hope it's useful for sysadmins at other sites.

Clock limits

There are two types of time limit operating on our cluster; a wall clock limit and a CPU time limit. The wall clock limit is the duration in real time that a job can last before it gets killed by the system. The CPU time limit is the amount of time that the job can run on a CPU before it gets killed. Example: Say you have a single CPU system, with two slots on it. When busy, it would be running two jobs. Each job would get about half the CPU. Therefore, it would make good sense to give it a wall clock limit of around twice the CPU limit, because the CPU is divided between two tasks. In systems where you have one job slot per CPU, wall and CPU limits are around the same value.

Notes:

  1. In practise, a job may not make efficient use of a CPU – it may wait for input or output to occur. A factor is sometimes applied to the wall clock limit to try to account for CPU inefficiencies. E.g. on our single CPU system, with two job slots on it, we might choose a wall clock limit of twice the CPU limit, then add some margin to the wall clock limit to account for CPU inefficiency.

  2. From now on, I assume that multi-core nodes have a full complement of slots, and are not over-subscribed, i.e. a system with N cores has N slots.

Scaling factors

A scaling factor is a value that makes systems comparable. Say you have systemA, which executes 10 billion instructions per second. Let’s say you have a time limit of 10 hours. We assume one job slot per CPU. Now let us say that, after a year, we add a new system to our cluster, say systemB, which runs 20 billion instructions per second.

What should we do about the time limits to account for the different CPU speeds? We could have separate limits for each system, or we could use a scaling factor. The scaling factor could be equal to 1 on systemA and 2 on systemB. When deciding whether to kill a job, we take the time limit, and divide it by the scaling factor. This would be 10/1 for system A, and 10/2 for systemB. Thus, if the job has been running on systemB for more than 5 hours, it gets killed. We normalised the clock limits and made them comparable.

This can be used to expand an existing cluster with faster nodes, without requiring different clock limits for each node type. You have one limit, with different scaling factors.

Another use for the scaling factor is for accounting

The scaling factor is applied on the nodes themselves. The head node, that dispatches jobs to the worker nodes, doesn’t know about the normalised time limits. It regards all the nodes to have the same time limits. The same applies to accounting. The accounting system doesn’t know the real length of time spent by a particular system, even though 2 hours on systemB is worth as much work as 4 hours on systemA. It is transparent.

Example: The worker nodes also normalise the accounting figures. Let’s assume a job exists that takes four hours of CPU on systemA. The calculation for the accounting would be: usage = time * scaling factor, yielding 4 * 1 = 4 hours of CPU time (I’ll tell about the units used for CPU usage shortly.) The same job would take 2 hours on systemB. The calculation for the accounting would be: usage = time * scaling factor, yielding 2 * 2 = 4 hours, i.e. though the systemB machine is twice as fast, the figures for accounting still show that the same amount of work is done to complete the job.

The CPUs at a site

The configuration at our site contains these two definitions:

CE_PHYSCPU=776

CE_LOGCPU=776

They describe the number of CPU chips in our cluster and the total number of logical CPUs. The intent here is to model the fact that, very often, each CPU chip has multiple logical “computing units” within it (although not on our main cluster, yet). This may be implemented as multiple-cores or hyperthreading or other things. But the idea is that we have silicon chips with other units inside that do data processing. Anyway, in our case, we have 776 logical CPUs. And we have the same number of physical CPUs because we have 1 core per CPU. We are in the process of moving our cluster to a mixed system, at which time the values for these variables will need to be worked out anew.

Actual values for scaling factors

Now that the difference between physical and logical CPUs is apparent, I'll show ways to work out actual scaling values. The simplest example consists of a homogeneous site without scaling, where some new nodes of a different power need to be added.

Example: Let's imagine a notional cluster of 10 systems with 2 physical CPUs with two cores in each (call this typeC) to which we wish to add 5 systems with 4 physical CPUs with 1 core in each (call these typeD). To work out the scaling factor to use in the new nodes, to make them equivalent to the existing ones, we would use some benchmarking program, say HEPSPEC06, to obtain a value of the power for each system type. We then divide these values by the number of logical CPUs in each system, yielding a “per core” HEPSPEC06 value for each type of system. We can then work out the scaling factor for the new system:

scaling_factor=type_d_per_core_hs06/ type_c_per_core_hs06

The resulting value is then used in the /var/spool/pbs/mom_priv/config file of the new worker nodes, e.g.

$cpumult 1.86

$wallmult 1.86

These values are taken from an earlier cluster set-up at our site that used scaling. As a more complex example, it would be possible to define some notional reference strength to a round number and scale all CPUs to that value, using a similar procedure as above, but including all the nodes in the cluster, i.e. all nodes would have scaling values. The reference strength would be abstract.

The power at a site

The following information is used in conjunction with the number of CPUs at a site, to calculate the full power.

CE_OTHERDESCR=Cores=1,Benchmark=5.32-HEP-SPEC06

This has got parts. The first, Cores=, is the number of cores in each physical CPU in our system. In our case, it’s exactly 1. But if you have, say, 10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each, the values would be as follows:

CE_PHYSCPU=(10 x 2) + (5 x 4) = 40

CE_LOGCPU=(10 x 2 x 2) + (5 x 4×1) = 60

Therefore, at this site, Cores = CE_LOGCPU/CE_PHYSCPU = 60/40 = 1.5

The 2nd part is Benchmark=. In our case, this is the HEP-SPEC06 value from one of our worker nodes. The HEP-SPEC06 value is an estimate of the CPU strength that comes out of a benchmarking program. In our case, it works out at 5.32, and it is an estimate of the power of one logical CPU. This was easy to work out in a cluster with only one type of hardware. But if you have the notional cluster described above (10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each) you’d have to compute it more generically, like this:

Using the benchmarking program, find the value for the total HEP-SPEC06 for each type of system in your cluster. Call these the SYSTEM-HEP-SPEC06 for each system type (alternatively, if the benchmark script allows it, compute the PER-CORE-HEP-SPEC06 value for each type of system, and compute the SYSTEM-HEP-SPEC06 value for that system type by multiplying the PER-CORE-HEP-SPEC06 value by the number of cores in that system).

For each type of system, multiply the SYSTEM-HEP-SPEC06 value by the number of systems of that type that you have, yielding SYSTEM-TYPE-HEP-SPEC06 value – the total power of all of the systems of that type that you have in your cluster.

Add all the different SYSTEM-TYPE-HEP-SPEC06 values together, giving the TOTAL-HEP-SPEC06 value for your entire cluster.

Divide the TOTAL-HEP-SPEC06 by the CE_LOGCPU value, giving an average strength of a single core. This is the value that goes in the Benchmark= variable, mentioned above. Rational: this value could be multiplied by the CE_LOGCPU to give the full strength, independently of the types of node.

Stephen Burke explains

Stephen explained how you can calculate your full power, like this:

The installed capacity accounting will calculate your total CPU capacity as LogicalCPUs*Benchmark, so you should make sure that both of those values are right, i.e. LogicalCPUs should be the total number of cores in the sub cluster, and Benchmark should be the *average* HEPSPEC for those cores. (And hence Cores*Benchmark is the average power per physical CPU, although the accounting doesn’t use that directly.) That should be the real power of your system, regardless of any scaling in the batch system.

This means that, if we have the logical CPUs figure right, and the benchmark figure right, then the installed capacity will be published correctly. Basically, Cores * PhysicalCPUs must equal LogicalCPUs, and Cores * Benchmark gives the power per physical CPU.

Configuration variables for power publishing and accounting

I described above how the full strength of the cluster can be calculated. We also want to make sure we can calculate the right amount of CPU used by any particular job, via the logs and the scaled times. The relevant configuration variables in site-info.def are:

CE_SI00=1330

CE_CAPABILITY=CPUScalingReferenceSI00=1330

On our site, they are both the same (1330). I will discuss that in a moment. But first, read what Stephen Burke said about the matter:

If you want the accounting to be correct you .. have to .. scale the times within the batch system to a common reference. If you don’t … what you are publishing is right, but the accounting for some jobs will be under-reported. Any other scheme would result in some jobs being over-reported, i.e. you would be claiming that they got more CPU power than they really did, which is unfair to the users/VOs who submit those jobs.

In this extract, he is talking about making the accounting right in a heterogeneous cluster. We don’t have one yet, but we’ll cover the issues. He also wrote this:

You can separately publish the physical power in SI00 and the scaled power used in the batch system in ScalingReferenceSI00.

Those are the variables used to transmit the physical power and the scaled power.


Getting the power publishing right without sub-clustering

First, I’ll discuss CE_SI00 (SI00). This is used to publish the physical computing power. I’ll show how to work out the value for our site.

Note: Someone has decreed that (for the purposes of accounting) 1 x HEPSPEC06 is equal to 250 x SI00 (SpecInt2k). SpecInt2k is another benchmark program, so I call this accounting unit a bogoSI00, because it is distinct from the real, measured SpecInt2k and it is directly related to the real, HEPSPEC06 value.

I want to convert the average, benchmarked HEPSPEC06 strength of one logical CPU (5.32) into bogoSI00s by multiplying it by 250, giving 1330 on our cluster. As shown above, this value is a physical measure of the strength of one logical CPU, and it can be used to get a physical measure of the strength of our entire cluster by multiplying it by the total number of logical CPUs. The power publishing will be correct when I update the CE_SI00 value in site-info.def, and run YAIM etc.

Getting the accounting right without sub-clustering

Next, I’ll discuss getting the accounting right, which involves the CE_CAPABILITY=CPUScalingReferenceSI00 variable. This value will be used by the APEL accounting system to work out how much CPU has been provided to a given job. APEL will take the duration of the job, and multiply it by this figure to yield the actual CPU provided. As I made clear above, some worker nodes use a scaling factor to compensate for differences between worker nodes, i.e. a node may report adjusted CPU times, such that all the nodes are comparable. By scaling the times, the logs are adjusted to show the extra work that has been done. If the head node queries the worker node to see if a job has exceeded its wall clock limit, the scaled times are used, to make things consistent. This activity is unnoticeable in the head node and the accounting system.

There are several possible ways to decide the CPUScalingReferenceSI00 value, and you must choose one of them. I’ll go through them one at a time.

  • Case 1: First, imagine a homogeneous cluster, where all the nodes are the same and no scaling on the worker nodes takes place at all. In this case, the CPUScalingReferenceSI00 is the same as the value of one logical CPU, measured in HEPSPEC06 and expressed as bogoSI00, i.e. 1330 in our case. The accounting is unscaled, all the logical CPUs/slots/cores are the same, so the reference value is the same as a physical one.

Example: 1 hour of CPU (scaled by 1.0) multiplied by 1300 = 1300 bogoSI00_hours delivered.

  • Case 2: Next, image the same cluster, with one difference – you use scaling on the worker nodes to give the node a notional reference strength (e.g. to get round numbers). I might want all my nodes to have their strength normalised to 1000 bogoSI00s (4 x HEPSPEC06). I would use a scaling factor on the worker nodes of 1.33. The time reported for a job would be real_job_duration * 1.33. Thus, I could then set the CPUScalingReferenceSI00=1000 to still get accurate accounting figures.

Example: 1 hour of CPU (scaled by 1.33) multiplied by 1000 = 1300 bogoSI00_hours delivered, i.e. no change from case 1.

  • Case 3: Next, imagine the cluster in case 1 ( homogeneous cluster, no scaling), where I then add some new, faster nodes, making a heterogeneous cluster. This happened at Liverpool, before we split the clusters up. We dealt with this by using a scaling factor on the new worker nodes, so that they appeared equivalent in terms of CPU delivery to the existing nodes. Thus, in that case, no change was required to the CPUScalingReferenceSI00 value – it remained at 1330.

    Example: 1 hour of CPU (scaled locally by node-dependent value to make each node equivalent to 1330 bogoSI00) multiplied by 1300 = 1300 bogoSI00_hours delivered. No change from case 1. The scaling is done transparently on the new nodes.

  • Case 4: Another line of attack in a heterogeneous cluster is to use a different scaling factor on all the different node types, to normalise the different nodes to the same notional reference strength. As the examples above show, whatever reference value is selected, the same number of bogoSI00_hours is delivered in all cases, if the scaling values are applied appropriately on the worker nodes to make them equivalent to the reference strength choosen.

APEL changes to make this possible

Formerly, APEL looked only at the SI00 value. If you scaled in a heterogeneous cluster, you could choose to have your strength correct or your accounting correct but not both.

So APEL has been changed to alter the way the data is parsed out. This allows us to pass the scaling reference value in a variable called CPUScalingReferenceSI00. You may need to update your MON box to make use of this feature.

Written by Steve with help from Rob and John, Liverpool

19 April 2010

Friday, 19 March 2010

Latest HC tests in Manchester

While waiting for the storage that we are buying with the current tender that will bring us to 320 TB of usable space we are fixing the configuration to optimise the access on the current 80TB.
So we have cabled the 4 data servers in the configuration they and their peers will have eventually. The last test

http://gangarobot.cern.ch/hc/1203/test

was showing some progress.

We run it for 12 hours and we had 99% overall efficiency. In particular if compared to test

http://gangarobot.cern.ch/hc/991/test.

the other metrics look slightly better. The most noticeable thing, rather than the plain mean values, is the histogram shape of cpu/wall clock time and events/wallclock. They are much healthier with a bell shape instead of a U one. (i.e. especially in the cpu/wall clock we have a more predictable behaviour. In this test the tail of jobs towards zero is drastically reduced). This is only one test and we are still affected by a bad distribution of data in DPM as they are still mostly concentrated on 2 servers over 4. There are also other things we can tweak to optimize access. The next steps to do with the same test (for comparison) are:

1) Spread the data more evenly on the data servers if we can se04 was hammered for a good while and had load 80-100 for few hours according to nagios.

2) Increase the number of jobs that can run at the same time

3) Look at the distribution of jobs on the WN.This might be useful to know how to do it when we will have 8 cores rather than two.

4) Look at the job distribution in time.

Thursday, 15 October 2009

Squid cache for atlas on a 32bit machine

I installed the squid cache for atlas on a SL5 32bit machine. There are no rpms from the project in 32bit. There is a default OS squid rpm but it is apparently bugged and the request is to install a 2.7-STABLE7 version. So I got the source rpm from here

https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2

rpmbuild --rebuild frontier-squid-2.7.STABLE7-4.sl5.src.rpm

it will compile the squid for your system. And create a binary rpm in

/usr/src/redhat/RPMS/i386/frontier-squid-2.7.STABLE7-4.sl5.i386.rpm

rpm -ihv /usr/src/redhat/RPMS/i386/frontier-squid-2.7.STABLE7-4.sl5.i386.rpm

It will install everything in /home/squid - apparently it is relocatable but I don't mind the location so I left it.

Edit /home/squid/etc/squid.conf

Not everything you find in the BNL instructions is necessary. Here is my list of changes

acl SUBNET-NAME src SUBNET-IPS
<---- there are different ways to express this
http_access allow SUBNET-NAME
hierarchy stoplist cgi-bin ?

cache_mem 256 MB
maximum_object_size_in_memory 128 KB

cache_dir ufs /home/squid/var/cache 100000 16 256

maximum_object_size 1048576 KB

update_headers off

cache_log /home/squid/var/logs/cache.log
cache_store_log none

strip_query_terms off

refresh_pattern -i /cgi-bin/ 0 0% 0

cache_effective_user squid
<--- default is nobody doesn't have access to /home/squid
icp_port 0


Edit /home/squid/sbin/fn-local-squid.sh
Add these two lines

# chkconfig: - 99 21
# description: Squid cache startup script


then

ln -s /home/squid/sbin/fn-local-squid.sh /etc/init.d/squid
chkconfig --add squid
chkconfig squid on


Write to Rod Walker to authorize your machine in the gridka Frontier server (until RAL is up that's the server for Europe) if you can set up an alias for the machine do it before writing him.

To test the setup

wget http://frontier.cern.ch/dist/fnget.py
export http_proxy=http://YOU-SQUID-CAHE:3128
python fnget.py --url=http://atlassq1-fzk.gridka.de:8021/fzk/Frontier --sql="SELECT TABLE_NAME FROM ALL_TABLES"


if you get many lines similar to those below

COMP200_F0027_TAGS_SEQ
COMP200_F0037_IOVS_SEQ
COMP200_F0020_IOVS_SEQ


your cache is working.

Wednesday, 19 August 2009

Manchester update

I fixed the site bdii problem i.e. the site static information 'disappeared'. It didn't actually disappear it was just declared under mds-vo-name=resource instead of mds-vo-name=UKI-NORTHGRID-MAN-HEP AND THEREFORE GSTAT COULDN'T FIND IT. This was due to rgma and site bdii conflict. The rgma bdii (that didn't exist in very old versions) needs to be declared in the BDII_REGIONS in YAIM. I knew it but forgot completely I already fixed it when I reinstalled the machine few months ago so I spent a delightful afternoon parsing ldif files and ldap output, hacked the ldif, sort of fixed it and then asked for a proper solution. So... here we go I'm writing it down this time so I can google for myself. On the positive side I upgraded now to the latest version both site and top bdiii and the resource bdii on the CEs. So we now have shiny new attributes like Spec2006 &Co.

I also upgraded the CEs trying to fix our random instability problem which afflicts us. However I upgraded online without reinstalling everything and it makes me a bit nervous thinking that some files that needed change might have not been edited because they already exist. So I will completely reinstall the CEs starting with ce01 today.

Tuesday, 5 May 2009

Howto publish Users DNs accounting records

To publish the User DN records in the accounting you should add to your site-info.def the following line

APEL_PUBLISH_USER_DN="yes"

and reconfigure your MON box. This will change the parser configuration file

/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml

replacing this line

<JoinProcessor publishGlobalUserName="no">

with this one

<JoinProcessor publishGlobalUserName="yes">

this will affect only on the new records.

If you want to republish everything you need to replace in the same file

<Republish>missing</Republish>

with this line using the apropriate dates

<Republish recordStart="2006-02-01" recordEnd="2006-04-25">gap</Republish>

and publish a chunk of data at the time. The documentation suggests one month at the time to avoid to run out of memory. When you have finished put back the line

<Republish>missing</Republish>

Wednesday, 29 April 2009

NFS bug in older SL5 kernel

As mentioned previously ( http://northgrid-tech.blogspot.com/2009/03/replaced-nfs-servers.html ) we have recently upgraded our NFS servers and they now run on SL5. Shortly after going into production all LHCb jobs stalled at Manchester and we were blacklisted by the VO.

We were advised that it may be a lockd error, and asked to use the following python code to diagnose this:

-------------------------------------------------------------------
import fcntl
fp = open("lock-test.txt", "a")
fcntl.lockf(fp.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)
-------------------------------------------------------------------


The code did not give any errors and we therefore discounted this as the problem. Wind the clock on a fortnight (including a week's holiday over Easter) and we still have not found the problem so I tried the above code again, and bingo lockd was the problem. A quick search of the SL mailing list pointed me to this kernel bug
https://bugzilla.redhat.com/show_bug.cgi?id=459083

A quick update of the kernel and reboot and the problem was fixed.

Friday, 3 April 2009

Fixed MPI installation

Few months ago we installed MPI using glite packages and YAIM.

http://northgrid-tech.blogspot.com/2008/11/mpi-enabled.html

We never really tested it though until now. We have found few problems with YAIM:

YAIM creates an mpirun script that assumes ./ is in the path so the job was landing on WN but mpirun couldn't find the user script/executable. I corrected it prepending `pwd`/ in front of the script arguments at the end of the sript so it runs `pwd`/$@ instead of $@. I added this using yaim post functionality.

The if else statement that if used to build MPIEXEC_PATH is written in a contorted way and needs to be corrected. For example:

1) MPI_MPIEXEC_PATH is used in the if but YAIM doesn't write it in any system file that sets the env variable like grid-env.sh where the other MPI_* variable are set.

2) In the else statement there is an hardcoded path which atcually is chosen splitting the mpiexec executable MPI_MPICH_MPIEXEC points to from its directory.

3) YAIM doesn't rewrite mpirun once it's written so the hardcoded path can't be changed reconfiguring the node without manually deleting mpirun before. This make difficult to update or correct mistakes.

4) The existence of MPIEXEC_PATH is not checked and it should.

Anyway eventually we managed to run mpi jobs and we reported to the new TMB MPI working group what we have done because another site was experiencing the same problems. Hopefully they will correct these problems. Special thanks go to Chris Glasman who hunted down the inital problem with the path and patiently tested the changes we applied.

Wednesday, 25 March 2009

New Storage and atlas space tokens

We have finally installed all the units. They are ~84TB of usable space. 42TB are dedicated to atlas space tokens, the other 42TB are shared for now but will be moved into atlas space tokens when we see more usage.

We also have finally enabled all the space tokens requested by atlas. They are waiting to be inserted in Tier Of Atlas but below I report what we publish in the BDII.

ldapsearch -x -H ldap://site-bdii.tier2.hep.manchester.ac.uk:2170 -b o=grid '(GlueSALocalID=atlas*)' GlueSAStateAvailableSpace GlueSAStateUsedSpace| grep Glu
dn: GlueSALocalID=atlas,GlueSEUniqueID=dcache01.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 33411318000
GlueSAStateUsedSpace: 21533521683
dn: GlueSALocalID=atlas,GlueSEUniqueID=dcache02.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 48171274000
GlueSAStateUsedSpace: 4168774302
dn: GlueSALocalID=atlas:ATLASGROUPDISK:online,GlueSEUniqueID=bohr3223.tier2.he
GlueSAStateAvailableSpace: 1610612610
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASPRODDISK:online,GlueSEUniqueID=bohr3223.tier2.hep
GlueSAStateAvailableSpace: 2154863252
GlueSAStateUsedSpace: 44160003
dn: GlueSALocalID=atlas:ATLASSCRATCHDISK:online,GlueSEUniqueID=bohr3223.tier2.
GlueSAStateAvailableSpace: 3298534820
GlueSAStateUsedSpace: 62
dn: GlueSALocalID=atlas:ATLASLOCALGROUPDISK:online,GlueSEUniqueID=bohr3223.tie
GlueSAStateAvailableSpace: 3298534883
GlueSAStateUsedSpace: 0
dn: GlueSALocalID=atlas:ATLASDATADISK:online,GlueSEUniqueID=bohr3223.tier2.hep
GlueSAStateAvailableSpace: 8052760941
GlueSAStateUsedSpace: 302738
dn: GlueSALocalID=atlas,GlueSEUniqueID=bohr3223.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 28580000000
GlueSAStateUsedSpace: 709076288
dn: GlueSALocalID=atlas:ATLASMCDISK:online,GlueSEUniqueID=bohr3223.tier2.hep.m
GlueSAStateAvailableSpace: 3298534758
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASPRODDISK:online,GlueSEUniqueID=bohr3226.tier2.hep
GlueSAStateAvailableSpace: 2199023130
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASLOCALGROUPDISK:online,GlueSEUniqueID=bohr3226.tie
GlueSAStateAvailableSpace: 3298534883
GlueSAStateUsedSpace: 0
dn: GlueSALocalID=atlas:ATLASGROUPDISK:online,GlueSEUniqueID=bohr3226.tier2.he
GlueSAStateAvailableSpace: 1610612610
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASMCDISK:online,GlueSEUniqueID=bohr3226.tier2.hep.m
GlueSAStateAvailableSpace: 3298534758
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASDATADISK:online,GlueSEUniqueID=bohr3226.tier2.hep
GlueSAStateAvailableSpace: 8053063554
GlueSAStateUsedSpace: 125
dn: GlueSALocalID=atlas:ATLASSCRATCHDISK:online,GlueSEUniqueID=bohr3226.tier2.
GlueSAStateAvailableSpace: 3298534820
GlueSAStateUsedSpace: 62
dn: GlueSALocalID=atlas,GlueSEUniqueID=bohr3226.tier2.hep.manchester.ac.uk,mds
GlueSAStateAvailableSpace: 35730390000
GlueSAStateUsedSpace: 1172758

Tuesday, 24 March 2009

Replaced NFS servers

The NFS servers have been replaced in Manchester with two more powerful machines and two 1TB raided SATA disks. This should hopefully put a stop to the space problems we have suffered in the past few months both with atlas and lhcb and should also allow us to keep a bit more releases than before.

We also have a nice nagios graphs to monitor the space now as well as cfengine alerts.

http://tinyurl.com/d5n7eo

Thursday, 5 March 2009

Machine room update

After some sweet-talking we managed to get two extra air-con units installed in our old machine room. This room houses our 2005 cluster and our more recent CPU and storage purchased last year. The extra cooling was noticeable and allowed us to switch on a couple of racks which were otherwise offline.


In other news, the new data centre is coming along nicely and will be ready for handover in 3/4 months from now. If you're ever racing past Lancaster on the M6 you'll get a good view of the Borg mothership on the hill, the sleak black cladding is going up now...

Friday, 13 February 2009

This week's DPM troubles at Lancaster.

We've had some interesting time this week in Lancaster, a tale of Gremlins, Greedy Daemons and Magical Faeries who come in the night and fix your DPM problems.

On Tuesday evening, when we've all gone home for the night, the DPM srmv1 daemon (and to a lesser extent the srmv2.2 and dpm daemons) started gobbling up system resources, sending our headnode into a swapping frenzy. There are known memory leak problems in the DPM code, and we've been victim of them before but in those instances we've always been saved by a swift restart of the affected services and the worse that happened was a sluggish DPM. This time the DPM servies completely froze up, and around 7 pm we started failing tests.

So coming into this disaster on Wednesday morning we leaped into action. Restarting the services fixed the load on the headnode, but the DPM still wouldn't work. Checking the logs showed that all requests were being queued, apparently forever. The trail led to some error messages in the mysqld.log;

090211 12:05:37 [ERROR] /usr/libexec/mysqld: Lock wait timeout exceeded;
try restarting transaction
090211 12:05:37 [ERROR] /usr/libexec/mysqld: Sort aborted

The oracle Google pointed that these kind of errors were indicative of a mysql server in a bad state after suddenly loosing connection to a client but not accounting for this. Various restarts, reboots and threats were used, but nothing would get the dpm working and we had to go into downtime.

Rather then dive blindly into the bowels of the DPM backend mysql we got in contact with the DPM developers on the DPM support list. They were really quick to respond, and after recieving 40MB of (zipped!) log files from us set to work developing a strategy to fix us. It appears that our mysql had grown much larger then it should have, "bloating" with historical data, which contributed to it getting into a bad state and made the task of repairing the database harder- partly as we simply couldn't restore from backups as these too would be "bloated".

After a while of bashing our heads, scouring logs and waiting for news from the DPM chaps we decided to make use of the downtime and upgrade the RAM on our headnode to 4 GB (from 2), a task we had been saving for the scheduled downtime when we finally upgrade to the Holy Grail that is DPM 1.7.X. So we slapped in the RAM, brung the machine up clearly, and left it.

A bit over an hour after it came up after the upgrade the headnode started working again. As if by magic. Nothing notable in the logs, it just started working again. The theory is that the added RAM allowed the mysql to chug through a backlog of requests and start working again. But that's just speculation. The dpm chaps are still puzzling over what happened, and our databases are still bloated, but the crisis (for now).

So there are 2 morals to this tale;
1) I wouldn't advise running a busy DPM headnode with less then 4GB of RAM, it leads to unpredictable behaviour.
2) If you get stuck in an Unscheduled Downtime you might as well make use of it to do any work, you never know when something magical might happen!

Monday, 9 February 2009

Jobmanager pbsqueue cache locked

Spent last week tracking down a problem where jobs were finishing in the batch system but the jobmanager wasn't recognizing this. This meant that jobs never 'completed' which had two major impacts, 1. Steve's test jobs all failed through timeouts and 2. Atlas production stopped because it looked like the pilots never completed and no further pilots were sent.

Some serious detective work was undertaken my Maarten and Andrey and it turned out the pbscache wasn't being updated due to a stale lock file in the ~/.lcgjm/ directory. The lock files can be found with this on the CE:

find /home/*/.lcgjm/pbsqueue.cache.proc.{hold.localhost.*,locked} -mtime +7 -ls

We had 6 users affected (alas, our important ones!), all with lock files dated Dec 22. Apparently the lcgpbs script Helper.pm would produce these whenever hostname returned 'localhost'. Yes, on December 22 we had maintenance work with DHCP unavailable, and for some brief period the CE hostname was 'localhost'. Note this is lcg-CE under glite-3.1. Happy days are here again!