I've replaced the standard DEFRAG daemon released with Condor with a simpler version that contains a proportional integral (PI) controller.
I hoped this would give us better control over multicore slots. Preliminary results with the proportional part of the controller show that it fails to keep accurate control over the provision of slots. It is subject to hunting due to the long time lags between the onset of drainin and the eventual change in the controlled variable (which is 'running mcore jobs'). The rate of provision was unexpectedly stable at first, considering the simplicity of the algorithm employed, but degraded over time as the controlled variable became more random.
The graph below shows the very preliminary picture, with a temporary period of stable control shown by the green line on the right of the plot. The setpoint is 250.
I have also now included an Integral component to the controller, and I'm in the process of tuning the reset rate on this. I hope to show the results of this test soon.
Northgrid-tech
Tuesday 17 February 2015
Monday 17 November 2014
Condor Workernode Heath Script
This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, I made use a a Condor feature; startd_cron jobs. I put this in my /etc/condor_config.local file on my worker nodes.
ENABLE_PERSISTENT_CONFIG = TRUE PERSISTENT_CONFIG_DIR = /etc/condor/ral STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs StartJobs = False RalNodeOnline = FalseI use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.
STARTD_CRON_JOBLIST=TESTNODE STARTD_CRON_TESTNODE_EXECUTABLE=/usr/libexec/condor/scripts/testnodeWrapper.sh STARTD_CRON_TESTNODE_PERIOD=300s # Make sure values get over STARTD_CRON_AUTOPUBLISH = If_ChangedThe testnodeWrapper.sh script looks like this:
#!/bin/bash MESSAGE=OK /usr/libexec/condor/scripts/testnode.sh > /dev/null 2>&1 STATUS=$? if [ $STATUS != 0 ]; then MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/testnode.sh | head -n 1 | sed -e "s/=.*//"` if [[ -z "$MESSAGE" ]]; then MESSAGE=ERROR fi fi if [[ $MESSAGE =~ ^OK$ ]] ; then echo "RalNodeOnline = True" else echo "RalNodeOnline = False" fi echo "RalNodeOnlineMessage = $MESSAGE" echo `date`, message $MESSAGE >> /tmp/testnode.status exit 0This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.
condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false" condor_reconfig r21-n01 condor_reconfig -daemon startd r21-n01
Tuesday 14 October 2014
Tired of full /var ?
This is how I prevent /var from getting full on any of our servers. I wrote these two scripts, spacemonc.py and spacemond.py.
spacemonc.py is a client, and it is installed on each grid system and worker node as a cronjob:
spacemond.py is installed as a service; you'll have to pinch a /etc/init.d script to start and stop it properly (or do it from the command line to start with.) And the code for spacemond.py is pretty small, too:
# crontab -l | grep spacemonc.py 50 18 * * * /root/bin/spacemonc.pyBecause it's going to be an (almost) single threaded server, I use puppet to make it run at a random time on each system (I say "almost" because it actually uses method level locking to hold each thread in a sleep state, so it's actually a queueing server, I think; it won't drop simultaneous incoming connections, but it's unwise to allow too many of them to occur at once.)
cron { "spacemonc": #ensure => absent, command => "/root/bin/spacemonc.py", user => root, hour => fqdn_rand(24), minute => fqdn_rand(60), }And it's pretty small:
/usr/bin/python import xmlrpclib import os import subprocess from socket import gethostname proc = subprocess.Popen(["df | perl -p00e 's/\n\s//g' | grep -v ^cvmfs | grep -v hepraid[0-9][0-9]*_[0-9]"], stdout=subprocess.PIPE, shell=True) (dfReport, err) = proc.communicate() s = xmlrpclib.ServerProxy('http://SOMESERVEROROTHER.COM.ph.liv.ac.uk:8000') status = s.post_report(gethostname(),dfReport) if (status != 1): print("Client failed");The strange piece of perl in the middle is to stop a bad habit in df of breaking lines that have long fields (I hate that; ldapsearch and qstat also do it.) I don't want to know about cvmfs partitions, nor raid storage mounts.
spacemond.py is installed as a service; you'll have to pinch a /etc/init.d script to start and stop it properly (or do it from the command line to start with.) And the code for spacemond.py is pretty small, too:
#!/usr/local/bin/python2.4 import sys from SimpleXMLRPCServer import SimpleXMLRPCServer from SimpleXMLRPCServer import SimpleXMLRPCRequestHandler import time import smtplib import logging if (len(sys.argv) == 2): limit = int(sys.argv[1]) else: limit = 90 # Maybe put logging in some time logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)s %(message)s', filename="/var/log/spacemon/log", filemode='a') # Email details smtpserver = 'hep.ph.liv.ac.uk' recipients = ['sjones@hep.ph.liv.ac.uk','sjones@hep.ph.liv.ac.uk'] sender = 'root@SOMESERVEROROTHER.COM.ph.liv.ac.uk' msgheader = "From: root@SOMESERVEROROTHER.COM.ph.liv.ac.uk\r\nTo: YOURNAME@hep.ph.liv.ac.uk\r\nSubject: spacemon report\r\n\r\n" # Test the server started session = smtplib.SMTP(smtpserver) smtpresult = session.sendmail(sender, recipients, msgheader + "spacemond server started\n") session.quit() # Restrict to a particular path. class RequestHandler(SimpleXMLRPCRequestHandler): rpc_paths = ('/RPC2',) # Create server server = SimpleXMLRPCServer(("SOMESERVEROROTHER.COM", 8000), requestHandler=RequestHandler) server.logRequests = 0 server.register_introspection_functions() # Class with a method to process incoming reports class SpaceMon: def post_report(address,hostname,report): full_messages = [] full_messages[:] = [] # Always empty it lines = report.split('\n') for l in lines[1:]: fields = l.split() if (len(fields) >= 5): fs = fields[0] pc = fields[4][:-1] ipc = int(pc) if (ipc >= limit ): full_messages.append("File system " + fs + " on " + hostname + " is getting full at " + pc + " percent.\n") if (len(full_messages) > 0): session = smtplib.SMTP(smtpserver) smtpresult = session.sendmail(sender, recipients, msgheader + ("").join(full_messages)) session.quit() logging.info(("").join(full_messages)) else: logging.info("Happy state for " + hostname ) return 1 # Register and serve server.register_instance(SpaceMon()) server.serve_forever()And now I get an email if any of my OS partitions is getting too full. It's surpising how small server software can be when you use a framework like XMLRPC. In the old days, I would have needed 200 lines of parsing code and case statements. Goodbye to all that.
Thursday 3 July 2014
APEL EMI-3 upgrade
Here are some notes from Manchester upgrade to EMI-3 APEL. The new APEL is much simpler as it is a bunch of python scripts with a couple of key=value configuration files, rather than java scripts with XML files. It doesn't have YAIM to configure it but since it is much easier to install and configure it doesn't really matter anymore. As an added bonus I found that it's also much faster when it publishes and doesn't require any tedious tuning of how many records at the time to publish.
So Manchester starting point to upgrade was
Install a new EMI-3 APEL node
Installed a vanilla VM with
Configure EMI-3 APEL node
I followed the instructions on the official EMI-3 APEL server guide.
There are no tips here I've only changed the obvious fields Like site_name and password plus few others like the top BDII because we have a local one and the location of the hostcertificate because we have a different name.
I didn't install install the publisher cron job at this stage because the machine was not ready yet to publish
Upgrade the CEs parsers to EMI-3 and point them the new node
The CEs as I said are already on EMI-3, only the APEL parsers were still EMI-2 so I disabled the EMI-2 cron job
NOTE: the parser configuration file for me is a bit confusing regarding the batch system name it states
# Batch system hostname. This does not need to be a definitive hostname,
# but it should uniquely identify the batch system.
# Example: pbs.gridpp.rl.ac.uk
lrms_server =
It seems you can use any name. You are of course better off using your batch system server name. We have one for each CE so the configuration file on each contains that. In the database this will identify the records from each machine CE. I'm not sure about what happens with 1 batch system and several CEs. Following literally one should put only the batch system but then there is no distinction between CEs.
Disable the old EMI-2 APEL node and backup its DB
Just removed the old cron job the machine is still running but it isn't doing anything while waiting to be decomissioned.
Run the parsers and fill the new APEL node DB
You will need to publish an entire month prior to when you are installing. For example for us it was publish all the June records, but since I didn't want to republish everything we had in the log files I moved the batch system and blah log files prior to mid May to a backup subdirectory and parsed only the log files for end of May June. May days were needed because some jobs that finished in June early days had started in May and one wants the complete record. The first jobs to finish in June in Manchester started on the 25th of May so you may want to go back a bit with the parsing.
Publish all records for the previous month from the new APEL machine
Finally on the new machine now filled with the June records plus some May I've done a bit of DB clean up as suggested by the APEL team. If you don't do this step the APEL team will do it centrally before stitching the old EMI-2 record and the new ones
So Manchester starting point to upgrade was
- EMI-2 APEL node
- EMI-2 APEL parsers on EMI-3 cream CEs
- We have 1 batch system per CE so I haven't tried a configuration in which there is only 1 batch system and multiple CEs
- In few months we may move to ARC-CE so configuration was done mostly manually
- Install a new EMI-3 APEL node
- Configure it
- Upgrade the CEs parsers to EMI-3 and point them the new node
- Disable the old EMI-2 APEL node and backup its DB
- Run the parsers and fill the new APEL node DB
- Publish all records for the previous month from the new APEL machine
Install a new EMI-3 APEL node
Installed a vanilla VM with
- EMI-3 repositories
- Mysql DB
- Host certificates
- ca-policy-egi-core
- yum install --nogpg emi-release
- yum install apel-ssm apel-client apel-lib
Configure EMI-3 APEL node
I followed the instructions on the official EMI-3 APEL server guide.
There are no tips here I've only changed the obvious fields Like site_name and password plus few others like the top BDII because we have a local one and the location of the hostcertificate because we have a different name.
I didn't install install the publisher cron job at this stage because the machine was not ready yet to publish
Upgrade the CEs parsers to EMI-3 and point them the new node
The CEs as I said are already on EMI-3, only the APEL parsers were still EMI-2 so I disabled the EMI-2 cron job
- rm /etc/cron.d/glite-apel-pbs-parser
- yum install apel-parser
NOTE: the parser configuration file for me is a bit confusing regarding the batch system name it states
# Batch system hostname. This does not need to be a definitive hostname,
# but it should uniquely identify the batch system.
# Example: pbs.gridpp.rl.ac.uk
lrms_server =
It seems you can use any name. You are of course better off using your batch system server name. We have one for each CE so the configuration file on each contains that. In the database this will identify the records from each machine CE. I'm not sure about what happens with 1 batch system and several CEs. Following literally one should put only the batch system but then there is no distinction between CEs.
Disable the old EMI-2 APEL node and backup its DB
Just removed the old cron job the machine is still running but it isn't doing anything while waiting to be decomissioned.
Run the parsers and fill the new APEL node DB
You will need to publish an entire month prior to when you are installing. For example for us it was publish all the June records, but since I didn't want to republish everything we had in the log files I moved the batch system and blah log files prior to mid May to a backup subdirectory and parsed only the log files for end of May June. May days were needed because some jobs that finished in June early days had started in May and one wants the complete record. The first jobs to finish in June in Manchester started on the 25th of May so you may want to go back a bit with the parsing.
Publish all records for the previous month from the new APEL machine
Finally on the new machine now filled with the June records plus some May I've done a bit of DB clean up as suggested by the APEL team. If you don't do this step the APEL team will do it centrally before stitching the old EMI-2 record and the new ones
- Delete from JobRecords where EndTime<"2014-06-01";
- Delete from SuperSummaries where Month="5";
Wednesday 7 May 2014
Planning for SHA-2
Timeline
The voms servers at CERN will be transferred to new hosts that use the newer SHA-2 certificate standard. The changes are described in this post:
CERN VOMS service will move to new hosts
The picture below lays out the timeline for the change.
Timeline for Cern Voms Server Changes |
New VOMS Server Hosts
The VOs associated with these changes are alice, atlas, cms, lhcb and ops. Sites supporting any of those will have to make a plan to update.
The new hosts have been set up already and entered against the related VOs in the ops portal. The table below summarises the current set up (ignoring vo.racf.bnl.gov) as advertised in the operations portal (as of 7th May 2014).
VO | Vomses Port | Old Server | Is admin? | New Server | IsAdmin? |
---|---|---|---|---|---|
atlas | 15001 | lcg-voms.cern.ch | No | lcg-voms2.cern.ch | Yes |
atlas | 15001 | voms.cern.ch | Yes | voms2.cern.ch | Yes |
alice | 15000 | lcg-voms.cern.ch | No | lcg-voms2.cern.ch | Yes |
alice | 15000 | voms.cern.ch | Yes | voms2.cern.ch | Yes |
cms | 15002 | lcg-voms.cern.ch | No | lcg-voms2.cern.ch | Yes |
cms | 15002 | voms.cern.ch | Yes | voms2.cern.ch | Yes |
lhcb | 15003 | lcg-voms.cern.ch | No | lcg-voms2.cern.ch | Yes |
lhcb | 15003 | voms.cern.ch | Yes | voms2.cern.ch | Yes |
ops | 15009 | lcg-voms.cern.ch | No | lcg-voms2.cern.ch | Yes |
ops | 15009 | voms.cern.ch | Yes | voms2.cern.ch | Yes |
Notes: The IsAdmin flag tells whether the server could be used to download used to create the DN grid-map file. The port numbers are unaffected by the change.
VOMS Server RPMS
As described in the announcement (see link at the top), a set of rpms have been created, one per WLCG-related VO:- wlcg-voms-alice
- wlcg-voms-atlas
- wlcg-voms-cms
- wlcg-voms-lhcb
- wlcg-voms-ops
The rpms are hosted in the yum repository WLCG repository. To install, e.g.
$ cd /etc/yum.repos.d/
$ wget http://linuxsoft.cern.ch/wlcg/wlcg-sl6.repo
Local Measures at Liverpool
At Liverpool, the configuration of the following servers will need to be changed:- Argus
- Cream CE
- DPM SE
- WN and
...
- UI (eventually)
There will be a gap of some weeks (see the picture) between the deadline for sites to update their services which consume certificates (e.g. Argus, Cream CE, DPM SE, and WN etc.) and the deadline for sites to update their UIs. This is to prevent the use of new-style certificates that cannot be interpreted.
So, to effect this change, Liverpool will apply the RPMS on our consuming service nodes in early May. As soon as the all-sites deadline has passed (2nd June) Liverpool will update its UIs in a similar manner.
If all goes well, Liverpool will remove reference to the old servers after the final deadline, 1st July. The plan in this case is to effect the change using the traditional yaim/site-info.def/vo.d method as these changes will need to be permanently maintained.
Effects on Approved VOs, VomsSnooper etc.
For tracking proposes, the GridPP Approved VOs document will attempt to remain synchronised with the operations portal, but the VomsSnooper process is asynchronous so there may be discrepancies around the deadlines. Sites are advised to watch out for these race conditions.Note: while the servers are being changed (i.e from now until 2nd June for certificate consuming services, and from 2nd June to 1 July (for consuming producing services, e.g. UIs) there can no canonical form of the VOMS records because different sites have their own implementation schedule and may use different settings temporarily, as described in my post above.
Monday 28 April 2014
Snakey - a mindless way to reboot the cluster
Introduction
I'm fed up with all the book-keeping when I need to reboot or rebuild our cluster.
First I need to set a subset of nodes offline. Then I have to monitor them until some are drained. Then, as soon as any is drained, I have to reboot it by hand, then wait for it to build, then test it and finally put it back online, Then I choose another set (maybe a rack) and go through the same thing over and over until the cluster is done.
So, to cut all that, I've written a pair of perl scripts, called snakey.pl and post_snakey.pl. I run each (at the same time) in a terminal and they do all that work for me, so I can do other things, like Blog Posts. Start snakey.pl first.
Note: all this assumes the use of the test nodes suite written by Rob Fay, at Liverpool.
Part 1 – Snakey
This perl script, called snakey.pl, reads a large list, and puts a selection offline with testnodes. It drains them, and reboots them once drained. For each one that gets booted, another from the list is offlined. In this way, it "snakes" through the selected part of the cluster. Our standard buildtools+puppet+yaim system takes care of the provisioning.
Part 2 – Post Snakey
Another script, post_snakey.pl, tells if the nodes have been rebooted by snakey, and if they pass the testnodes test script. Any that do are put back on , so they come online. The scripts have some safety locks to stop havoc breaking out. They usually just stop if anything weird is seen.
Part 3 – Source Code
#!/usr/bin/perl
use strict;
use Fcntl ':flock';
use Getopt::Long;
my %offlineTimes;
while ( 1 ) {
%offlineTimes = getOfflineTimes();
my @a=keys(%offlineTimes);
my $count = $#a;
if ($count == -1 ) {
print("No work to do\n");
exit(0);
}
foreach my $n (keys(%offlineTimes)) {
my $uptime = -1;
open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n cat /proc/uptime 2>&1|");
while() {
if (/([0-9\.]+)\s+[0-9\.]+/) {
$uptime = $1;
}
}
close(B);
if ($uptime == -1) {
print("Refusing to remove $n because it may not have been rebooted\n");
}
else {
my $offlineTime = $offlineTimes{$n};
my $timeNow = time();
if ($timeNow - $uptime <= $offlineTime ) {
print("Refusing to remove $n. ");
printf("Last reboot - %6.3f days ago. ", $uptime / 24 / 60 /60);
printf("Offlined - %6.3f days ago.\n", ($timeNow - $offlineTime) / 24 / 60 /60);
}
else {
print("$n has been rebooted\n");
open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n ./testnode.sh|");
while() { }
close(B);
my $status = $? >> 8;
if ($status == 0) {
print("$n passes testnode.sh; will remove from exemptions\n");
removeFromExemptions($n);
}
else {
print("$n is not passing testnode.sh - $status\n");
}
}
}
}
sleep 567;
}
#-----------------------------------------
sub getOfflineTimes() {
my %offlineTimes = ();
open(TN,"
while() {
if (/(\S+)\s+\# snakey.pl put this offline (\d+)/) {
$offlineTimes{$1} = $2;
}
}
close(TN);
return %offlineTimes;
}
#-----------------------------------------
sub removeFromExemptions($) {
my $node = shift();
open(TN," my @lines = ;
close( TN );
open(TN,">/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
foreach my $line ( @lines ) {
print TN $line unless ( $line =~ m/$node/ );
}
close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}
I'm fed up with all the book-keeping when I need to reboot or rebuild our cluster.
First I need to set a subset of nodes offline. Then I have to monitor them until some are drained. Then, as soon as any is drained, I have to reboot it by hand, then wait for it to build, then test it and finally put it back online, Then I choose another set (maybe a rack) and go through the same thing over and over until the cluster is done.
So, to cut all that, I've written a pair of perl scripts, called snakey.pl and post_snakey.pl. I run each (at the same time) in a terminal and they do all that work for me, so I can do other things, like Blog Posts. Start snakey.pl first.
Note: all this assumes the use of the test nodes suite written by Rob Fay, at Liverpool.
Part 1 – Snakey
This perl script, called snakey.pl, reads a large list, and puts a selection offline with testnodes. It drains them, and reboots them once drained. For each one that gets booted, another from the list is offlined. In this way, it "snakes" through the selected part of the cluster. Our standard buildtools+puppet+yaim system takes care of the provisioning.
Another script, post_snakey.pl, tells if the nodes have been rebooted by snakey, and if they pass the testnodes test script. Any that do are put back on , so they come online. The scripts have some safety locks to stop havoc breaking out. They usually just stop if anything weird is seen.
Part 3 – Source Code
You've seen all the nice blurb, so
here's the source code. I've had to fix it up because HTML knackers the "<", ">" and "&" chars - I hope I haven't broken it.
Note: not the cleanest code I've ever written, but it gets the job done.
Good luck!
Note: not the cleanest code I've ever written, but it gets the job done.
Good luck!
----- snakey.pl ----------------------
#!/usr/bin/perl
use strict;
use Fcntl ':flock';
use Getopt::Long;
sub initParams();
my %parameter;
initParams();
my @nodesToDo;
open(NODES,"$parameter{'NODES'}") or die("Cannot open file of nodes to reboot, $!\n");
while() {
chomp($_);
push(@nodesToDo,$_);
}
close(NODES);
checkOk(@nodesToDo);
my @selection = selectSome($parameter{'SLICE'});
foreach my $n(@selection) {
print "Putting $n offline\n";
putOffline($n);
}
while( $#selection > -1) {
my $drainedNode = '';
while($drainedNode eq '') {
sleep( 600 );
$drainedNode = checkIfOneHasDrained(@selection);
}
@selection = remove($drainedNode,@selection);
print("Rebooting $drainedNode\n");
my $status = rebootNode($drainedNode);
print("status -- $status\n");
my @nextOne = selectSome(1);
if ($#nextOne == 0) {
my $nextOne = $nextOne[0];
print "Putting $nextOne offline\n";
putOffline($nextOne);
push(@selection,$nextOne);
}
}
#-----------------------------------------
sub putOffline() {
my $node = shift();
open(TN,"/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
while() {
my $l = $_;
chomp($l);
$l =~ s/#.*//;
$l =~ s/\s*//g;
if ($node =~ /^$l$/) {
print ("Node $node is already in testnodes-exemptions.txt\n");
return;
}
}
close(TN);
open(TN,">>/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
print (TN "$node # snakey.pl put this offline " . time() . "\n");
close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}
#-----------------------------------------
sub remove() {
my $drained = shift();
my @poolOfNodes = @_;
my @newSelection = ();
foreach my $n (@poolOfNodes) {
if ($n !~ /$drained/) {
push(@newSelection,$n);
}
}
die("None removed\n") unless($#newSelection == ($#poolOfNodes -1));
return @newSelection;
}
#-----------------------------------------
sub checkIfOneHasDrained(@) {
my @nodesToCheck = @_;
foreach my $n (@nodesToCheck) {
my $hadReport = 0;
my $state = "";
my $jobCount = 0;
open(PBSNODES,"pbsnodes $n|");
while() {
my $l = $_;
chomp($l);
if ($l =~ /state = (.*)/) {
$state = $1;
$hadReport = 1;
}
if (/jobs = (.*)/) {
my $jobs = $1;
my @jobs = split(/,/,$jobs);
$jobCount = $#jobs + 1;
}
}
close(PBSNODES);
print("Result of check on $n: hadReport - $hadReport, state - $state, jobCount - $jobCount\n");
if (($hadReport) && ($state eq 'offline') && ($jobCount ==0)) {
return $n;
}
}
return "";
}
#-----------------------------------------
sub selectSome($) {
my $max = shift;
my @some = ();
for (my $ii = 0; $ii < $max; $ii++) {
if (defined($nodesToDo[0])) {
push(@some,shift(@nodesToDo));
}
}
return @some;
}
#-----------------------------------------
sub checkOk(){
my @nodes = @_;
foreach my $n (@nodes) {
my $actualNode = 0;
my $state = "";
open(PBSNODES,"pbsnodes $n|") or die("Could not run pbsnodes, $!\n");
while() {
if (/state = (.*)/) {
$state = $1;
$actualNode = 1;
}
}
close(PBSNODES);
if (! $actualNode) {
die("Node $n was not an actual one!\n");
}
if ($state =~ /offline/) {
die ("Node $n was already offline!\n");
}
}
return @nodes;
}
#-----------------------------------------
sub initParams() {
GetOptions ('h|help' => \$parameter{'HELP'},
'n:s' => \$parameter{'NODES'} ,
's:i' => \$parameter{'SLICE'} ,
);
if (defined($parameter{'HELP'})) {
print <
Abstract: A tool to drain and boot a bunch of nodes
-h --help Prints this help page
-n nodes File of nodes to boot
-s slice Size of slice to offline at once
TEXT
exit(0);
}
if (!defined($parameter{'SLICE'})) {
$parameter{'SLICE'} = 5;
}
if (!defined($parameter{'NODES'})) {
die("Please give a file of nodes to reboot\n");
}
if (! -s $parameter{'NODES'} ) {
die("Please give a real file of nodes to reboot\n");
}
}
#-----------------------------------------
sub rebootNode($) {
my $nodeToBoot = shift();
my $nodeToCheck = $nodeToBoot;
my $pbsnodesWorked = 0;
my $hasJobs = 0;
open(PBSNODES,"pbsnodes $nodeToCheck|");
while() {
if (/state =/) {
$pbsnodesWorked = 1;
}
if (/^\s*jobs = /) {
$hasJobs = 1;
}
}
close(PBSNODES);
if (! $pbsnodesWorked) { return 0; }
if ( $hasJobs ) { return 0; }
open(REBOOT,"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=10 $nodeToBoot reboot|");
while() {
print;
}
return 1;
}
----- post-snakey.pl ----------------------
use strict;
use Fcntl ':flock';
use Getopt::Long;
sub initParams();
my %parameter;
initParams();
my @nodesToDo;
open(NODES,"$parameter{'NODES'}") or die("Cannot open file of nodes to reboot, $!\n");
while(
chomp($_);
push(@nodesToDo,$_);
}
close(NODES);
checkOk(@nodesToDo);
my @selection = selectSome($parameter{'SLICE'});
foreach my $n(@selection) {
print "Putting $n offline\n";
putOffline($n);
}
while( $#selection > -1) {
my $drainedNode = '';
while($drainedNode eq '') {
sleep( 600 );
$drainedNode = checkIfOneHasDrained(@selection);
}
@selection = remove($drainedNode,@selection);
print("Rebooting $drainedNode\n");
my $status = rebootNode($drainedNode);
print("status -- $status\n");
my @nextOne = selectSome(1);
if ($#nextOne == 0) {
my $nextOne = $nextOne[0];
print "Putting $nextOne offline\n";
putOffline($nextOne);
push(@selection,$nextOne);
}
}
#-----------------------------------------
sub putOffline() {
my $node = shift();
open(TN,"/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
while(
my $l = $_;
chomp($l);
$l =~ s/#.*//;
$l =~ s/\s*//g;
if ($node =~ /^$l$/) {
print ("Node $node is already in testnodes-exemptions.txt\n");
return;
}
}
close(TN);
open(TN,">>/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
print (TN "$node # snakey.pl put this offline " . time() . "\n");
close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}
#-----------------------------------------
sub remove() {
my $drained = shift();
my @poolOfNodes = @_;
my @newSelection = ();
foreach my $n (@poolOfNodes) {
if ($n !~ /$drained/) {
push(@newSelection,$n);
}
}
die("None removed\n") unless($#newSelection == ($#poolOfNodes -1));
return @newSelection;
}
#-----------------------------------------
sub checkIfOneHasDrained(@) {
my @nodesToCheck = @_;
foreach my $n (@nodesToCheck) {
my $hadReport = 0;
my $state = "";
my $jobCount = 0;
open(PBSNODES,"pbsnodes $n|");
while(
my $l = $_;
chomp($l);
if ($l =~ /state = (.*)/) {
$state = $1;
$hadReport = 1;
}
if (/jobs = (.*)/) {
my $jobs = $1;
my @jobs = split(/,/,$jobs);
$jobCount = $#jobs + 1;
}
}
close(PBSNODES);
print("Result of check on $n: hadReport - $hadReport, state - $state, jobCount - $jobCount\n");
if (($hadReport) && ($state eq 'offline') && ($jobCount ==0)) {
return $n;
}
}
return "";
}
#-----------------------------------------
sub selectSome($) {
my $max = shift;
my @some = ();
for (my $ii = 0; $ii < $max; $ii++) {
if (defined($nodesToDo[0])) {
push(@some,shift(@nodesToDo));
}
}
return @some;
}
#-----------------------------------------
sub checkOk(){
my @nodes = @_;
foreach my $n (@nodes) {
my $actualNode = 0;
my $state = "";
open(PBSNODES,"pbsnodes $n|") or die("Could not run pbsnodes, $!\n");
while(
if (/state = (.*)/) {
$state = $1;
$actualNode = 1;
}
}
close(PBSNODES);
if (! $actualNode) {
die("Node $n was not an actual one!\n");
}
if ($state =~ /offline/) {
die ("Node $n was already offline!\n");
}
}
return @nodes;
}
#-----------------------------------------
sub initParams() {
GetOptions ('h|help' => \$parameter{'HELP'},
'n:s' => \$parameter{'NODES'} ,
's:i' => \$parameter{'SLICE'} ,
);
if (defined($parameter{'HELP'})) {
print <
Abstract: A tool to drain and boot a bunch of nodes
-h --help Prints this help page
-n nodes File of nodes to boot
-s slice Size of slice to offline at once
TEXT
exit(0);
}
if (!defined($parameter{'SLICE'})) {
$parameter{'SLICE'} = 5;
}
if (!defined($parameter{'NODES'})) {
die("Please give a file of nodes to reboot\n");
}
if (! -s $parameter{'NODES'} ) {
die("Please give a real file of nodes to reboot\n");
}
}
#-----------------------------------------
sub rebootNode($) {
my $nodeToBoot = shift();
my $nodeToCheck = $nodeToBoot;
my $pbsnodesWorked = 0;
my $hasJobs = 0;
open(PBSNODES,"pbsnodes $nodeToCheck|");
while(
if (/state =/) {
$pbsnodesWorked = 1;
}
if (/^\s*jobs = /) {
$hasJobs = 1;
}
}
close(PBSNODES);
if (! $pbsnodesWorked) { return 0; }
if ( $hasJobs ) { return 0; }
open(REBOOT,"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=10 $nodeToBoot reboot|");
while(
print;
}
return 1;
}
----- post-snakey.pl ----------------------
#!/usr/bin/perl
use strict;
use Fcntl ':flock';
use Getopt::Long;
my %offlineTimes;
while ( 1 ) {
%offlineTimes = getOfflineTimes();
my @a=keys(%offlineTimes);
my $count = $#a;
if ($count == -1 ) {
print("No work to do\n");
exit(0);
}
foreach my $n (keys(%offlineTimes)) {
my $uptime = -1;
open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n cat /proc/uptime 2>&1|");
while() {
if (/([0-9\.]+)\s+[0-9\.]+/) {
$uptime = $1;
}
}
close(B);
if ($uptime == -1) {
print("Refusing to remove $n because it may not have been rebooted\n");
}
else {
my $offlineTime = $offlineTimes{$n};
my $timeNow = time();
if ($timeNow - $uptime <= $offlineTime ) {
print("Refusing to remove $n. ");
printf("Last reboot - %6.3f days ago. ", $uptime / 24 / 60 /60);
printf("Offlined - %6.3f days ago.\n", ($timeNow - $offlineTime) / 24 / 60 /60);
}
else {
print("$n has been rebooted\n");
open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n ./testnode.sh|");
while() { }
close(B);
my $status = $? >> 8;
if ($status == 0) {
print("$n passes testnode.sh; will remove from exemptions\n");
removeFromExemptions($n);
}
else {
print("$n is not passing testnode.sh - $status\n");
}
}
}
}
sleep 567;
}
#-----------------------------------------
sub getOfflineTimes() {
my %offlineTimes = ();
open(TN,"
if (/(\S+)\s+\# snakey.pl put this offline (\d+)/) {
$offlineTimes{$1} = $2;
}
}
close(TN);
return %offlineTimes;
}
#-----------------------------------------
sub removeFromExemptions($) {
my $node = shift();
open(TN,"
close( TN );
open(TN,">/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
foreach my $line ( @lines ) {
print TN $line unless ( $line =~ m/$node/ );
}
close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}
Tuesday 15 April 2014
Kernel Problems at Liverpool
Introduction
Liverpool recently updated its cluster to SL6. In doing so, a problem occurred whereby the kernel would experience lockups during normal operations. The signs are unresponsiveness, drop-outs in Ganglia and (later) many "task...blocked for 120 seconds" msgs in /var/log/m.. and dmesg.Description
Kernels in the range 2.6.32-431* exhibited a type of deadlock when run on certain hardware with BIOS dated after 8th March 2010.This problem occured on Supermicto hardware, main boards:
- X8DTT-H
- X9DRT
Notes:
1) No hardware with BIOS dated 8th March 2010 or before showed this defect, even on the same board type.
2) The oldest kernel of the 2.6.32-358 range is solid. This is corroborated by operational experience with the 358 range.
3) All current kernels in the 2.6.32-431 range exhibited the problem on our newest hardware, and a few nodes of the older hardware that had had unusual BIOS updates.
Testing
The lock-ups are hard to reproduce, but after a great deal of trail and error, a ~ 90% effective predictor was found.The procedure is to:
- Build the system completely new in the usual way and
- When yaim gets to "config_user", use a script (stress.sh) to run 36 threads of gzip and one of iozone.
On a susceptible node, this is reasonably certain to make it lock up after a minute. The signs are unresponsiveness and (later) "task...blocked for 120 seconds" msgs in /var/log/m.. and dmesg.
I observed that if the procedure is not followed "exactly", it is unreliable as a predictor. In particular, if you stop Yaim and try again, the predictor is useless.
To test that, I isolated the config_users script from Yaim, and ran it separately along with the stress.sh script. Result: useless - no lock-ups were seen.
Note: This result was rather unexpected because the isolated config_users.sh script works in the same way as the original.
Unsuccessful Theories
A great many theories were tested and rejected or not pursued further (APIC problems, disk problems, BIOS differences,various kernels, examination of kernel logs, much googling etc. etc.) Eventually, a seemingly successful theory was stumbled upon which I describe below.The Successful Theory
All our nodes had unusual vm settings:# grep dirty /etc/sysctl.conf
vm.dirty_background_ratio = 100
vm.dirty_expire_centisecs = 1800000
vm.dirty_ratio = 100
These custom settings facilitate the storage of atlas "short files" in RAM. Basically, they force files to remain off disk for a long time, allowing very fast access.
The modification had been tested almost exhaustively for several years on earlier kernels - but perhaps some change (or latent bug?) in the kernel had invalidated them somehow.
We came up with the idea that the issue originates in the memory operations that occur prior to Yaim/config_users. This would explain why anything but the exact activity created by the procedure might well not trigger the defect. We thought this could tally with the idea of the ATLAS "short file" modifications in sysctl.conf. The theory is that these mods set up the problem during the memory/read/write operations (i.e. the asynchronous OS loading and flushing of the page cache).
To test this, I used the predictor on susceptible nodes , but without applying the ATLAS "short file" patch. Default vm settings were adopted instead.
Result
Very satisfying at last - absolutely no sign on the defect. As the ATLAS "short file" patch is not very beneficial given the current data traffic, we have decided to go back to default "vm.dirty" settings and monitor the situation carefully.
Subscribe to:
Posts (Atom)