Thursday, 9 September 2010
The first 3 rows in each rack are the computing nodes, the machines at the bottom are the storage units. The storage also has become unbelievably compact and cheap. When we bought the DELL cluster 500TB was an enormity and extremely expensive if organised in proper data servers and this is why we tried to use the WNs disks. The new storage is 540TB of usable storage, fits in 9 4U machines and is considered commodity computing nowadays. Well... almost. ;)
Saturday, 4 September 2010
mrtg uses snmp. So to enable the monitoring you need your squid instance compiled with --enable-snmp. CERN binaries are already compiled with that option the default squid coming with SL5 OS might not, your site centralised squid service might not. You don't need snmpd or snmptracd (net-snmp rpm) running to make it work.
Once you are sure the binary is compiled with the right options and that port 3401 is not blocked by any firewall you need to add these lines to squid.conf
acl SNMPHOSTS src 188.8.131.52/24 localhost
acl SNMPMON snmp_community public
snmp_access allow SNMPMON SNMPHOSTS
snmp_access deny all
again if you are using the CERN rpms and the default frontier configuration you might not need to do that as there are already ACL lines for the monitoring.
Reload the configuration
service squid reload
snmpwalk -v2c -Cc -c public localhost:3401 .184.108.40.206.4.1.3495.1.1
you should get something similar to this:
SNMPv2-SMI::enterprises.34220.127.116.11.0 = INTEGER: 206648
SNMPv2-SMI::enterprises.3418.104.22.168.0 = INTEGER: 500136
SNMPv2-SMI::enterprises.3422.214.171.124.0 = Timeticks: (23672459) 2 days, 17:45:24.59
snmpwalk is part of net-snmp-utils rpm.
It takes a while for the monitoring to catch up. Don't expect an immediate response.
Additional information on squid/snmp can be found here http://wiki.squid-cache.org/Features/Snmp
NOTE: If you are also upgrading pay attention to the fact that in the latest CERN rpms the init script fn-local-squid.sh might try to regenerate your squid.conf.
Tuesday, 31 August 2010
The trend in August was probably helped by moving all the space to the DATADISK space token and attracting more interesting data. LOCALGROUP is also heavily used in Manchester.
In the past 4 days we also have applied the XFS file system tuning suggested by John that solves the load on the data servers experienced since upgrading to SL5. The tweak has increased notably the data throughput and reduced the load on the data servers practically to zero allowing us to increase the number of concurrent jobs. This has allowed a bigger job throughput and has had a clear improvement on the job efficiency isolating as most inefficient the very short ones (<10 mins CPU time) and even then the improvement is also notable as it is possible to see from the plots below.
Before applying the tweak
After applying the tweak
This also means we can keep on using XFS for the data servers which has currently more flexibility as far as partition sizes are concerned.
Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).
blockdev --setra 16384 /dev/$RAIDDEVICEThis sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.The mystery for us is more why the SL4 systems *don't* overload rather than why SL5 does, as the SL4 systems use the exact same default values.Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.http://hep.ph.liv.ac.uk/~jbland/xfs-fix.htmlAny time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size withecho "64" > /sys/block/$RAIDDEVICE/queue/max_sectors_kbalthough we don't think this had any bearing on the overloading and might not be necessary.
We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.
Wednesday, 18 August 2010
I performed the tests on /scratch partition writing the log in /. I did it twice one mounting and unmounting the fs at each test so to delete any trace of information from the buffer cache and one leaving the fs mounted between tests. Tests were automatically repeated for sizes from 64kB to 4GB and record length between 4kB - 16384kB. Iozone automatically doubles the previous sizes at each test (4GB is the smallest multiples smaller than the 5GB file size limit I set).
From the numbers ext4 performs much better in writing while reading is basically the same if not slightly worst for smaller files. There is a big drop in performance for both file systems for the 4GB size.
What however I find confusing is that I did the tests again setting the max size of the file at 100M and doing only write tests and ext3 takes less time despite (22 secs vs 44s in this case) despite the numbers saying that writing is almost 40% faster there is something that slows the tests down (deleting?). Speed of tests become similar for sizes >500MB they both decrease steadily until they finally drop at 4GB for any record length in both file systems.
Below some results mostly with the buffer cache because not having it affects mostly ext3 for small sizes of file and rec length as shown in the first graph.
EXT3: write (NO buffer cache)
EXT3: write (buffer cache)
EXT4: write (buffer cache)
EXT3: read (buffer cache)
EXT4: read (buffer cache)
Wednesday, 4 August 2010
They opened one for Manchester too - actually we were slightly flooded with tickets we must have some decent data on the storage.
The problem turned out to be that on the 18/6/2010 the biomed VOMS server CA DN has changed. If you find these messages (if you google for them you get only source code entries) on your DPM srmv2.2 log files:
08/03 11:14:30 4863,0 srmv2.2: SRM02 - soap_serve error : [::ffff:126.96.36.199] (kingkong.creatis.insa-lyon.fr) : CGSI-gSOAP running on bohr3226.tier2.hep.manchester.ac.uk reports Error retrieveing the VOMS credentials
than you know you must update the configuration on your system replacing
Note: don't forget YAIM too if you don't want to override. I updated the YAIM configuration on the GridPP wiki
Tuesday, 27 July 2010
*) Request a certificate for the machine if you don't have one already.
*) Kickstart a machine vanilla SL5, two raid1 disks.
*) Install mysql-server-5.0.77-3.el5 (it's in the SL5 repository)
*) Remove /var/lib/mysql and recreated it empty (you can skip this but I messed around with it earlier and needed a clean dir).
*) Start mysqld
service mysqld start
It will tell you at this point to create the root password.
/usr/bin/mysqladmin -u root password 'pwd-in-site-info.def''
/usr/bin/mysqladmin -u root -h <machine-fqdn> password 'pwd-in-site-info.def'
*) Install the certificate (we have it directly in cfengine).
*) Setup the yum repositories if your configuration tool doesn't do it already
*) Install glite-APEL
yum install glite-APEL
*) Run yaim: it sets up the database and most of all the ACL, if you have more than one CE to publish you need to run it for each CE changing the name of the CE in site-info.def or if you are skilled with SQL you need to set the permissions for each CE to have write access.
/opt/glite/yaim/bin/yaim -s /opt/glite/yaim/etc/site-info.def -c -n glite-APEL
*) BUG: APEL still uses JAVA. Anytime it is run it creates a JAVA key store with all the CAs and host certificate added to it. It might happen that on your machine you get the OS JAVA version and the one you install (normally 1.6). The tool used to create the keystore file is called by a script without setting the path so if you have both versions of the command it is likely that the OS one is called because it resides in /usr/bin. Useless to say the OS version is older and doesn't have all the options used in the APEL script. There are a number of ways to fix this I modified the script to insert absolute path, you can change the link target in /usr/bin or you can add a modified path to the apel cron job. The culprit script is this:
and belongs to
The problem is known and apparently a fix is in certification. My ticket is here
*) Register the machine in GOCDB making sure you tick glite-APEL and not APEL to mark it as a service.
*) BUG: UK host certificates have an email attribute. This email has a different format in the output of different clients. When you register the machine put the host DN as it is. Then open a GGUS ticket for APEL so they can change it internally. This is also known and followed in this savannah bug but at the moment they have to change it manually. Below the savannah bug.
*) Dump the DB on on the old MON box with mysqldump. I thought I could tar it up but it didn't like it so I used this instead.
mysqldump -C -Q -u root -p accounting | gzip -c > accounting.sql.gz
*) Copy to and reload on the new machine
zcat accounting.sql.gz | mysql -u root -p accounting
*) Run APEL manually and see how it goes (command is in the cron job).
If you are happy with it go on with the last two steps, otherwise you have found an additional problem I haven't found.
*) Disable the publisher on the old machine, i.e. remove the cron job.
*) Modify parser-config-yaim.xml for all the CEs so they point to the new machine. The line to modify is
SWITCHING OFF RGMA
When I was happy with the new APEL machine I turned off RGMA and removed it from the services published by the BDII and the GOCDB. This caused the GlueSite object to disappear from our site BDII. You need to have the site BDII in the list of services published before you remove RGMA.
Tuesday, 29 June 2010
Lancaster had a scheduled downtime yesterday to relocate half our storage and CPU to the new High End Computing data centre. The move went very smoothly and both storage and CPU are back online running ATLAS (and other) jobs. The new HEC facility is a significant investment from Lancaster University with the multi million pound building housing central IT facilities, co-location, and HEC data centres.
Lancaster's GridPP operations have a large stake in HEC with Roger Jones (GridPP/ATLAS) taking directorship of the facility. Our future hardware procurement will be located in this room which has a 35-rack capacity using water-cooled Rittel racks. Below are some photographs of the room as it looks today. Ten racks are in place with two being occupied by GridPP hardware and the remaining eight to be populated in July with £800k worth of compute and storage resource.
Tuesday, 20 April 2010
Our main cluster at Liverpool is homogeneous at present, but that will change shortly. This document is part of our preparation for supporting a heterogeneous cluster. It will help us to find the scaling values and HEPSPEC06 power settings for correct publishing and accounting in a heterogeneous cluster that uses TORQUE/Maui without sub-clustering. I haven't fully tested the work, but I think it's correct and I hope it's useful for sysadmins at other sites.
There are two types of time limit operating on our cluster; a wall clock limit and a CPU time limit. The wall clock limit is the duration in real time that a job can last before it gets killed by the system. The CPU time limit is the amount of time that the job can run on a CPU before it gets killed. Example: Say you have a single CPU system, with two slots on it. When busy, it would be running two jobs. Each job would get about half the CPU. Therefore, it would make good sense to give it a wall clock limit of around twice the CPU limit, because the CPU is divided between two tasks. In systems where you have one job slot per CPU, wall and CPU limits are around the same value.
In practise, a job may not make efficient use of a CPU – it may wait for input or output to occur. A factor is sometimes applied to the wall clock limit to try to account for CPU inefficiencies. E.g. on our single CPU system, with two job slots on it, we might choose a wall clock limit of twice the CPU limit, then add some margin to the wall clock limit to account for CPU inefficiency.
From now on, I assume that multi-core nodes have a full complement of slots, and are not over-subscribed, i.e. a system with N cores has N slots.
A scaling factor is a value that makes systems comparable. Say you have systemA, which executes 10 billion instructions per second. Let’s say you have a time limit of 10 hours. We assume one job slot per CPU. Now let us say that, after a year, we add a new system to our cluster, say systemB, which runs 20 billion instructions per second.
What should we do about the time limits to account for the different CPU speeds? We could have separate limits for each system, or we could use a scaling factor. The scaling factor could be equal to 1 on systemA and 2 on systemB. When deciding whether to kill a job, we take the time limit, and divide it by the scaling factor. This would be 10/1 for system A, and 10/2 for systemB. Thus, if the job has been running on systemB for more than 5 hours, it gets killed. We normalised the clock limits and made them comparable.
This can be used to expand an existing cluster with faster nodes, without requiring different clock limits for each node type. You have one limit, with different scaling factors.
Another use for the scaling factor is for accounting
The scaling factor is applied on the nodes themselves. The head node, that dispatches jobs to the worker nodes, doesn’t know about the normalised time limits. It regards all the nodes to have the same time limits. The same applies to accounting. The accounting system doesn’t know the real length of time spent by a particular system, even though 2 hours on systemB is worth as much work as 4 hours on systemA. It is transparent.
Example: The worker nodes also normalise the accounting figures. Let’s assume a job exists that takes four hours of CPU on systemA. The calculation for the accounting would be: usage = time * scaling factor, yielding 4 * 1 = 4 hours of CPU time (I’ll tell about the units used for CPU usage shortly.) The same job would take 2 hours on systemB. The calculation for the accounting would be: usage = time * scaling factor, yielding 2 * 2 = 4 hours, i.e. though the systemB machine is twice as fast, the figures for accounting still show that the same amount of work is done to complete the job.
The CPUs at a site
The configuration at our site contains these two definitions:
They describe the number of CPU chips in our cluster and the total number of logical CPUs. The intent here is to model the fact that, very often, each CPU chip has multiple logical “computing units” within it (although not on our main cluster, yet). This may be implemented as multiple-cores or hyperthreading or other things. But the idea is that we have silicon chips with other units inside that do data processing. Anyway, in our case, we have 776 logical CPUs. And we have the same number of physical CPUs because we have 1 core per CPU. We are in the process of moving our cluster to a mixed system, at which time the values for these variables will need to be worked out anew.
Actual values for scaling factors
Now that the difference between physical and logical CPUs is apparent, I'll show ways to work out actual scaling values. The simplest example consists of a homogeneous site without scaling, where some new nodes of a different power need to be added.
Example: Let's imagine a notional cluster of 10 systems with 2 physical CPUs with two cores in each (call this typeC) to which we wish to add 5 systems with 4 physical CPUs with 1 core in each (call these typeD). To work out the scaling factor to use in the new nodes, to make them equivalent to the existing ones, we would use some benchmarking program, say HEPSPEC06, to obtain a value of the power for each system type. We then divide these values by the number of logical CPUs in each system, yielding a “per core” HEPSPEC06 value for each type of system. We can then work out the scaling factor for the new system:
The resulting value is then used in the /var/spool/pbs/mom_priv/config file of the new worker nodes, e.g.
These values are taken from an earlier cluster set-up at our site that used scaling. As a more complex example, it would be possible to define some notional reference strength to a round number and scale all CPUs to that value, using a similar procedure as above, but including all the nodes in the cluster, i.e. all nodes would have scaling values. The reference strength would be abstract.
The power at a site
The following information is used in conjunction with the number of CPUs at a site, to calculate the full power.
This has got parts. The first, Cores=, is the number of cores in each physical CPU in our system. In our case, it’s exactly 1. But if you have, say, 10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each, the values would be as follows:
CE_PHYSCPU=(10 x 2) + (5 x 4) = 40
CE_LOGCPU=(10 x 2 x 2) + (5 x 4×1) = 60
Therefore, at this site, Cores = CE_LOGCPU/CE_PHYSCPU = 60/40 = 1.5
The 2nd part is Benchmark=. In our case, this is the HEP-SPEC06 value from one of our worker nodes. The HEP-SPEC06 value is an estimate of the CPU strength that comes out of a benchmarking program. In our case, it works out at 5.32, and it is an estimate of the power of one logical CPU. This was easy to work out in a cluster with only one type of hardware. But if you have the notional cluster described above (10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each) you’d have to compute it more generically, like this:
Using the benchmarking program, find the value for the total HEP-SPEC06 for each type of system in your cluster. Call these the SYSTEM-HEP-SPEC06 for each system type (alternatively, if the benchmark script allows it, compute the PER-CORE-HEP-SPEC06 value for each type of system, and compute the SYSTEM-HEP-SPEC06 value for that system type by multiplying the PER-CORE-HEP-SPEC06 value by the number of cores in that system).
For each type of system, multiply the SYSTEM-HEP-SPEC06 value by the number of systems of that type that you have, yielding SYSTEM-TYPE-HEP-SPEC06 value – the total power of all of the systems of that type that you have in your cluster.
Add all the different SYSTEM-TYPE-HEP-SPEC06 values together, giving the TOTAL-HEP-SPEC06 value for your entire cluster.
Divide the TOTAL-HEP-SPEC06 by the CE_LOGCPU value, giving an average strength of a single core. This is the value that goes in the Benchmark= variable, mentioned above. Rational: this value could be multiplied by the CE_LOGCPU to give the full strength, independently of the types of node.
Stephen Burke explains
Stephen explained how you can calculate your full power, like this:
The installed capacity accounting will calculate your total CPU capacity as LogicalCPUs*Benchmark, so you should make sure that both of those values are right, i.e. LogicalCPUs should be the total number of cores in the sub cluster, and Benchmark should be the *average* HEPSPEC for those cores. (And hence Cores*Benchmark is the average power per physical CPU, although the accounting doesn’t use that directly.) That should be the real power of your system, regardless of any scaling in the batch system.
This means that, if we have the logical CPUs figure right, and the benchmark figure right, then the installed capacity will be published correctly. Basically, Cores * PhysicalCPUs must equal LogicalCPUs, and Cores * Benchmark gives the power per physical CPU.
Configuration variables for power publishing and accounting
I described above how the full strength of the cluster can be calculated. We also want to make sure we can calculate the right amount of CPU used by any particular job, via the logs and the scaled times. The relevant configuration variables in site-info.def are:
On our site, they are both the same (1330). I will discuss that in a moment. But first, read what Stephen Burke said about the matter:
If you want the accounting to be correct you .. have to .. scale the times within the batch system to a common reference. If you don’t … what you are publishing is right, but the accounting for some jobs will be under-reported. Any other scheme would result in some jobs being over-reported, i.e. you would be claiming that they got more CPU power than they really did, which is unfair to the users/VOs who submit those jobs.
In this extract, he is talking about making the accounting right in a heterogeneous cluster. We don’t have one yet, but we’ll cover the issues. He also wrote this:
You can separately publish the physical power in SI00 and the scaled power used in the batch system in ScalingReferenceSI00.
Those are the variables used to transmit the physical power and the scaled power.
Getting the power publishing right without sub-clustering
First, I’ll discuss CE_SI00 (SI00). This is used to publish the physical computing power. I’ll show how to work out the value for our site.
Note: Someone has decreed that (for the purposes of accounting) 1 x HEPSPEC06 is equal to 250 x SI00 (SpecInt2k). SpecInt2k is another benchmark program, so I call this accounting unit a bogoSI00, because it is distinct from the real, measured SpecInt2k and it is directly related to the real, HEPSPEC06 value.
I want to convert the average, benchmarked HEPSPEC06 strength of one logical CPU (5.32) into bogoSI00s by multiplying it by 250, giving 1330 on our cluster. As shown above, this value is a physical measure of the strength of one logical CPU, and it can be used to get a physical measure of the strength of our entire cluster by multiplying it by the total number of logical CPUs. The power publishing will be correct when I update the CE_SI00 value in site-info.def, and run YAIM etc.
Getting the accounting right without sub-clustering
Next, I’ll discuss getting the accounting right, which involves the CE_CAPABILITY=CPUScalingReferenceSI00 variable. This value will be used by the APEL accounting system to work out how much CPU has been provided to a given job. APEL will take the duration of the job, and multiply it by this figure to yield the actual CPU provided. As I made clear above, some worker nodes use a scaling factor to compensate for differences between worker nodes, i.e. a node may report adjusted CPU times, such that all the nodes are comparable. By scaling the times, the logs are adjusted to show the extra work that has been done. If the head node queries the worker node to see if a job has exceeded its wall clock limit, the scaled times are used, to make things consistent. This activity is unnoticeable in the head node and the accounting system.
There are several possible ways to decide the CPUScalingReferenceSI00 value, and you must choose one of them. I’ll go through them one at a time.
Case 1: First, imagine a homogeneous cluster, where all the nodes are the same and no scaling on the worker nodes takes place at all. In this case, the CPUScalingReferenceSI00 is the same as the value of one logical CPU, measured in HEPSPEC06 and expressed as bogoSI00, i.e. 1330 in our case. The accounting is unscaled, all the logical CPUs/slots/cores are the same, so the reference value is the same as a physical one.
Example: 1 hour of CPU (scaled by 1.0) multiplied by 1300 = 1300 bogoSI00_hours delivered.
Case 2: Next, image the same cluster, with one difference – you use scaling on the worker nodes to give the node a notional reference strength (e.g. to get round numbers). I might want all my nodes to have their strength normalised to 1000 bogoSI00s (4 x HEPSPEC06). I would use a scaling factor on the worker nodes of 1.33. The time reported for a job would be real_job_duration * 1.33. Thus, I could then set the CPUScalingReferenceSI00=1000 to still get accurate accounting figures.
Example: 1 hour of CPU (scaled by 1.33) multiplied by 1000 = 1300 bogoSI00_hours delivered, i.e. no change from case 1.
Case 3: Next, imagine the cluster in case 1 ( homogeneous cluster, no scaling), where I then add some new, faster nodes, making a heterogeneous cluster. This happened at Liverpool, before we split the clusters up. We dealt with this by using a scaling factor on the new worker nodes, so that they appeared equivalent in terms of CPU delivery to the existing nodes. Thus, in that case, no change was required to the CPUScalingReferenceSI00 value – it remained at 1330.
Example: 1 hour of CPU (scaled locally by node-dependent value to make each node equivalent to 1330 bogoSI00) multiplied by 1300 = 1300 bogoSI00_hours delivered. No change from case 1. The scaling is done transparently on the new nodes.
Case 4: Another line of attack in a heterogeneous cluster is to use a different scaling factor on all the different node types, to normalise the different nodes to the same notional reference strength. As the examples above show, whatever reference value is selected, the same number of bogoSI00_hours is delivered in all cases, if the scaling values are applied appropriately on the worker nodes to make them equivalent to the reference strength choosen.
APEL changes to make this possible
Formerly, APEL looked only at the SI00 value. If you scaled in a heterogeneous cluster, you could choose to have your strength correct or your accounting correct but not both.
So APEL has been changed to alter the way the data is parsed out. This allows us to pass the scaling reference value in a variable called CPUScalingReferenceSI00. You may need to update your MON box to make use of this feature.
Written by Steve with help from Rob and John, Liverpool
19 April 2010
Friday, 19 March 2010
So we have cabled the 4 data servers in the configuration they and their peers will have eventually. The last test
was showing some progress.
We run it for 12 hours and we had 99% overall efficiency. In particular if compared to test
the other metrics look slightly better. The most noticeable thing, rather than the plain mean values, is the histogram shape of cpu/wall clock time and events/wallclock. They are much healthier with a bell shape instead of a U one. (i.e. especially in the cpu/wall clock we have a more predictable behaviour. In this test the tail of jobs towards zero is drastically reduced). This is only one test and we are still affected by a bad distribution of data in DPM as they are still mostly concentrated on 2 servers over 4. There are also other things we can tweak to optimize access. The next steps to do with the same test (for comparison) are:
1) Spread the data more evenly on the data servers if we can se04 was hammered for a good while and had load 80-100 for few hours according to nagios.
2) Increase the number of jobs that can run at the same time
3) Look at the distribution of jobs on the WN.This might be useful to know how to do it when we will have 8 cores rather than two.
4) Look at the job distribution in time.