Tuesday, 20 April 2010

Scaling, capacity publishing and accounting

Introduction

Our main cluster at Liverpool is homogeneous at present, but that will change shortly. This document is part of our preparation for supporting a heterogeneous cluster. It will help us to find the scaling values and HEPSPEC06 power settings for correct publishing and accounting in a heterogeneous cluster that uses TORQUE/Maui without sub-clustering. I haven't fully tested the work, but I think it's correct and I hope it's useful for sysadmins at other sites.

Clock limits

There are two types of time limit operating on our cluster; a wall clock limit and a CPU time limit. The wall clock limit is the duration in real time that a job can last before it gets killed by the system. The CPU time limit is the amount of time that the job can run on a CPU before it gets killed. Example: Say you have a single CPU system, with two slots on it. When busy, it would be running two jobs. Each job would get about half the CPU. Therefore, it would make good sense to give it a wall clock limit of around twice the CPU limit, because the CPU is divided between two tasks. In systems where you have one job slot per CPU, wall and CPU limits are around the same value.

Notes:

  1. In practise, a job may not make efficient use of a CPU – it may wait for input or output to occur. A factor is sometimes applied to the wall clock limit to try to account for CPU inefficiencies. E.g. on our single CPU system, with two job slots on it, we might choose a wall clock limit of twice the CPU limit, then add some margin to the wall clock limit to account for CPU inefficiency.

  2. From now on, I assume that multi-core nodes have a full complement of slots, and are not over-subscribed, i.e. a system with N cores has N slots.

Scaling factors

A scaling factor is a value that makes systems comparable. Say you have systemA, which executes 10 billion instructions per second. Let’s say you have a time limit of 10 hours. We assume one job slot per CPU. Now let us say that, after a year, we add a new system to our cluster, say systemB, which runs 20 billion instructions per second.

What should we do about the time limits to account for the different CPU speeds? We could have separate limits for each system, or we could use a scaling factor. The scaling factor could be equal to 1 on systemA and 2 on systemB. When deciding whether to kill a job, we take the time limit, and divide it by the scaling factor. This would be 10/1 for system A, and 10/2 for systemB. Thus, if the job has been running on systemB for more than 5 hours, it gets killed. We normalised the clock limits and made them comparable.

This can be used to expand an existing cluster with faster nodes, without requiring different clock limits for each node type. You have one limit, with different scaling factors.

Another use for the scaling factor is for accounting

The scaling factor is applied on the nodes themselves. The head node, that dispatches jobs to the worker nodes, doesn’t know about the normalised time limits. It regards all the nodes to have the same time limits. The same applies to accounting. The accounting system doesn’t know the real length of time spent by a particular system, even though 2 hours on systemB is worth as much work as 4 hours on systemA. It is transparent.

Example: The worker nodes also normalise the accounting figures. Let’s assume a job exists that takes four hours of CPU on systemA. The calculation for the accounting would be: usage = time * scaling factor, yielding 4 * 1 = 4 hours of CPU time (I’ll tell about the units used for CPU usage shortly.) The same job would take 2 hours on systemB. The calculation for the accounting would be: usage = time * scaling factor, yielding 2 * 2 = 4 hours, i.e. though the systemB machine is twice as fast, the figures for accounting still show that the same amount of work is done to complete the job.

The CPUs at a site

The configuration at our site contains these two definitions:

CE_PHYSCPU=776

CE_LOGCPU=776

They describe the number of CPU chips in our cluster and the total number of logical CPUs. The intent here is to model the fact that, very often, each CPU chip has multiple logical “computing units” within it (although not on our main cluster, yet). This may be implemented as multiple-cores or hyperthreading or other things. But the idea is that we have silicon chips with other units inside that do data processing. Anyway, in our case, we have 776 logical CPUs. And we have the same number of physical CPUs because we have 1 core per CPU. We are in the process of moving our cluster to a mixed system, at which time the values for these variables will need to be worked out anew.

Actual values for scaling factors

Now that the difference between physical and logical CPUs is apparent, I'll show ways to work out actual scaling values. The simplest example consists of a homogeneous site without scaling, where some new nodes of a different power need to be added.

Example: Let's imagine a notional cluster of 10 systems with 2 physical CPUs with two cores in each (call this typeC) to which we wish to add 5 systems with 4 physical CPUs with 1 core in each (call these typeD). To work out the scaling factor to use in the new nodes, to make them equivalent to the existing ones, we would use some benchmarking program, say HEPSPEC06, to obtain a value of the power for each system type. We then divide these values by the number of logical CPUs in each system, yielding a “per core” HEPSPEC06 value for each type of system. We can then work out the scaling factor for the new system:

scaling_factor=type_d_per_core_hs06/ type_c_per_core_hs06

The resulting value is then used in the /var/spool/pbs/mom_priv/config file of the new worker nodes, e.g.

$cpumult 1.86

$wallmult 1.86

These values are taken from an earlier cluster set-up at our site that used scaling. As a more complex example, it would be possible to define some notional reference strength to a round number and scale all CPUs to that value, using a similar procedure as above, but including all the nodes in the cluster, i.e. all nodes would have scaling values. The reference strength would be abstract.

The power at a site

The following information is used in conjunction with the number of CPUs at a site, to calculate the full power.

CE_OTHERDESCR=Cores=1,Benchmark=5.32-HEP-SPEC06

This has got parts. The first, Cores=, is the number of cores in each physical CPU in our system. In our case, it’s exactly 1. But if you have, say, 10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each, the values would be as follows:

CE_PHYSCPU=(10 x 2) + (5 x 4) = 40

CE_LOGCPU=(10 x 2 x 2) + (5 x 4×1) = 60

Therefore, at this site, Cores = CE_LOGCPU/CE_PHYSCPU = 60/40 = 1.5

The 2nd part is Benchmark=. In our case, this is the HEP-SPEC06 value from one of our worker nodes. The HEP-SPEC06 value is an estimate of the CPU strength that comes out of a benchmarking program. In our case, it works out at 5.32, and it is an estimate of the power of one logical CPU. This was easy to work out in a cluster with only one type of hardware. But if you have the notional cluster described above (10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each) you’d have to compute it more generically, like this:

Using the benchmarking program, find the value for the total HEP-SPEC06 for each type of system in your cluster. Call these the SYSTEM-HEP-SPEC06 for each system type (alternatively, if the benchmark script allows it, compute the PER-CORE-HEP-SPEC06 value for each type of system, and compute the SYSTEM-HEP-SPEC06 value for that system type by multiplying the PER-CORE-HEP-SPEC06 value by the number of cores in that system).

For each type of system, multiply the SYSTEM-HEP-SPEC06 value by the number of systems of that type that you have, yielding SYSTEM-TYPE-HEP-SPEC06 value – the total power of all of the systems of that type that you have in your cluster.

Add all the different SYSTEM-TYPE-HEP-SPEC06 values together, giving the TOTAL-HEP-SPEC06 value for your entire cluster.

Divide the TOTAL-HEP-SPEC06 by the CE_LOGCPU value, giving an average strength of a single core. This is the value that goes in the Benchmark= variable, mentioned above. Rational: this value could be multiplied by the CE_LOGCPU to give the full strength, independently of the types of node.

Stephen Burke explains

Stephen explained how you can calculate your full power, like this:

The installed capacity accounting will calculate your total CPU capacity as LogicalCPUs*Benchmark, so you should make sure that both of those values are right, i.e. LogicalCPUs should be the total number of cores in the sub cluster, and Benchmark should be the *average* HEPSPEC for those cores. (And hence Cores*Benchmark is the average power per physical CPU, although the accounting doesn’t use that directly.) That should be the real power of your system, regardless of any scaling in the batch system.

This means that, if we have the logical CPUs figure right, and the benchmark figure right, then the installed capacity will be published correctly. Basically, Cores * PhysicalCPUs must equal LogicalCPUs, and Cores * Benchmark gives the power per physical CPU.

Configuration variables for power publishing and accounting

I described above how the full strength of the cluster can be calculated. We also want to make sure we can calculate the right amount of CPU used by any particular job, via the logs and the scaled times. The relevant configuration variables in site-info.def are:

CE_SI00=1330

CE_CAPABILITY=CPUScalingReferenceSI00=1330

On our site, they are both the same (1330). I will discuss that in a moment. But first, read what Stephen Burke said about the matter:

If you want the accounting to be correct you .. have to .. scale the times within the batch system to a common reference. If you don’t … what you are publishing is right, but the accounting for some jobs will be under-reported. Any other scheme would result in some jobs being over-reported, i.e. you would be claiming that they got more CPU power than they really did, which is unfair to the users/VOs who submit those jobs.

In this extract, he is talking about making the accounting right in a heterogeneous cluster. We don’t have one yet, but we’ll cover the issues. He also wrote this:

You can separately publish the physical power in SI00 and the scaled power used in the batch system in ScalingReferenceSI00.

Those are the variables used to transmit the physical power and the scaled power.


Getting the power publishing right without sub-clustering

First, I’ll discuss CE_SI00 (SI00). This is used to publish the physical computing power. I’ll show how to work out the value for our site.

Note: Someone has decreed that (for the purposes of accounting) 1 x HEPSPEC06 is equal to 250 x SI00 (SpecInt2k). SpecInt2k is another benchmark program, so I call this accounting unit a bogoSI00, because it is distinct from the real, measured SpecInt2k and it is directly related to the real, HEPSPEC06 value.

I want to convert the average, benchmarked HEPSPEC06 strength of one logical CPU (5.32) into bogoSI00s by multiplying it by 250, giving 1330 on our cluster. As shown above, this value is a physical measure of the strength of one logical CPU, and it can be used to get a physical measure of the strength of our entire cluster by multiplying it by the total number of logical CPUs. The power publishing will be correct when I update the CE_SI00 value in site-info.def, and run YAIM etc.

Getting the accounting right without sub-clustering

Next, I’ll discuss getting the accounting right, which involves the CE_CAPABILITY=CPUScalingReferenceSI00 variable. This value will be used by the APEL accounting system to work out how much CPU has been provided to a given job. APEL will take the duration of the job, and multiply it by this figure to yield the actual CPU provided. As I made clear above, some worker nodes use a scaling factor to compensate for differences between worker nodes, i.e. a node may report adjusted CPU times, such that all the nodes are comparable. By scaling the times, the logs are adjusted to show the extra work that has been done. If the head node queries the worker node to see if a job has exceeded its wall clock limit, the scaled times are used, to make things consistent. This activity is unnoticeable in the head node and the accounting system.

There are several possible ways to decide the CPUScalingReferenceSI00 value, and you must choose one of them. I’ll go through them one at a time.

  • Case 1: First, imagine a homogeneous cluster, where all the nodes are the same and no scaling on the worker nodes takes place at all. In this case, the CPUScalingReferenceSI00 is the same as the value of one logical CPU, measured in HEPSPEC06 and expressed as bogoSI00, i.e. 1330 in our case. The accounting is unscaled, all the logical CPUs/slots/cores are the same, so the reference value is the same as a physical one.

Example: 1 hour of CPU (scaled by 1.0) multiplied by 1300 = 1300 bogoSI00_hours delivered.

  • Case 2: Next, image the same cluster, with one difference – you use scaling on the worker nodes to give the node a notional reference strength (e.g. to get round numbers). I might want all my nodes to have their strength normalised to 1000 bogoSI00s (4 x HEPSPEC06). I would use a scaling factor on the worker nodes of 1.33. The time reported for a job would be real_job_duration * 1.33. Thus, I could then set the CPUScalingReferenceSI00=1000 to still get accurate accounting figures.

Example: 1 hour of CPU (scaled by 1.33) multiplied by 1000 = 1300 bogoSI00_hours delivered, i.e. no change from case 1.

  • Case 3: Next, imagine the cluster in case 1 ( homogeneous cluster, no scaling), where I then add some new, faster nodes, making a heterogeneous cluster. This happened at Liverpool, before we split the clusters up. We dealt with this by using a scaling factor on the new worker nodes, so that they appeared equivalent in terms of CPU delivery to the existing nodes. Thus, in that case, no change was required to the CPUScalingReferenceSI00 value – it remained at 1330.

    Example: 1 hour of CPU (scaled locally by node-dependent value to make each node equivalent to 1330 bogoSI00) multiplied by 1300 = 1300 bogoSI00_hours delivered. No change from case 1. The scaling is done transparently on the new nodes.

  • Case 4: Another line of attack in a heterogeneous cluster is to use a different scaling factor on all the different node types, to normalise the different nodes to the same notional reference strength. As the examples above show, whatever reference value is selected, the same number of bogoSI00_hours is delivered in all cases, if the scaling values are applied appropriately on the worker nodes to make them equivalent to the reference strength choosen.

APEL changes to make this possible

Formerly, APEL looked only at the SI00 value. If you scaled in a heterogeneous cluster, you could choose to have your strength correct or your accounting correct but not both.

So APEL has been changed to alter the way the data is parsed out. This allows us to pass the scaling reference value in a variable called CPUScalingReferenceSI00. You may need to update your MON box to make use of this feature.

Written by Steve with help from Rob and John, Liverpool

19 April 2010