Monday, 28 April 2014

Snakey - a mindless way to reboot the cluster

Introduction

I'm fed up with all the book-keeping when I need to reboot or rebuild our cluster.

First I need to set a subset of nodes offline. Then I have to monitor them until some are drained. Then, as soon as any is drained, I have to reboot it by hand, then wait for it to build, then test it and finally put it back online, Then I choose another set (maybe a rack) and go through the same thing over and over until the cluster is done.

So, to cut all that, I've written a pair of perl scripts, called snakey.pl and post_snakey.pl. I run each (at the same time) in a terminal and they do all that work for me, so I can do other things, like Blog Posts. Start snakey.pl first.

Note: all this assumes the use of the test nodes suite written by Rob Fay, at Liverpool.

Part 1 – Snakey

This perl script, called snakey.pl, reads a large list, and puts a selection offline with testnodes. It drains them, and reboots them once drained. For each one that gets booted, another from the list is offlined. In this way, it "snakes" through the selected part of the cluster. Our standard buildtools+puppet+yaim system takes care of the provisioning.

Part 2 – Post Snakey

Another script, post_snakey.pl, tells if the nodes have been rebooted by snakey, and if they pass the testnodes test script. Any that do are put back on , so they come online. The scripts have some safety locks to stop havoc breaking out. They usually just stop if anything weird is seen.

Part 3 – Source Code

You've seen all the nice blurb, so here's the source code. I've had to fix it up because HTML knackers the "<", ">" and "&" chars - I hope I haven't broken it.

Note: not the cleanest code I've ever written, but it gets the job done.

Good luck!


----- snakey.pl ----------------------
#!/usr/bin/perl

use strict;
use Fcntl ':flock';
use Getopt::Long;

sub initParams();

my %parameter;

initParams();

my @nodesToDo;

open(NODES,"$parameter{'NODES'}") or die("Cannot open file of nodes to reboot, $!\n");
while() {
  chomp($_);
  push(@nodesToDo,$_);
}
close(NODES);

checkOk(@nodesToDo);

my @selection = selectSome($parameter{'SLICE'});
foreach my $n(@selection) {
  print "Putting $n offline\n";
  putOffline($n);
}

while( $#selection > -1) {

  my $drainedNode = '';
  while($drainedNode eq '') {
    sleep( 600 );
    $drainedNode = checkIfOneHasDrained(@selection);
  }
 
  @selection = remove($drainedNode,@selection);

  print("Rebooting $drainedNode\n");
  my $status = rebootNode($drainedNode);
  print("status -- $status\n");

  my @nextOne = selectSome(1);
  if ($#nextOne == 0) {
    my $nextOne = $nextOne[0];
    print "Putting $nextOne offline\n";
    putOffline($nextOne);
    push(@selection,$nextOne);
  }
}
#-----------------------------------------
sub putOffline() {
  my $node = shift();
  open(TN,"/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
  while() {
    my $l = $_;
    chomp($l);
    $l =~ s/#.*//;
    $l =~ s/\s*//g;
    if ($node =~ /^$l$/) {
      print ("Node $node is already in testnodes-exemptions.txt\n");
      return;
    }
  }
  close(TN);
  open(TN,">>/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
  flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
  print (TN "$node # snakey.pl put this offline " . time() . "\n");
  close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}
#-----------------------------------------
sub remove() {
  my $drained = shift();
  my @poolOfNodes = @_;

  my @newSelection = ();
  foreach my $n (@poolOfNodes) {
    if ($n !~ /$drained/) {
      push(@newSelection,$n);
    }
  }
  die("None removed\n") unless($#newSelection == ($#poolOfNodes -1));
  return @newSelection;
}

#-----------------------------------------
sub checkIfOneHasDrained(@) {
  my @nodesToCheck = @_;
  foreach my $n (@nodesToCheck) {
    my $hadReport = 0;
    my $state = "";
    my $jobCount = 0;

    open(PBSNODES,"pbsnodes $n|");
    while() {
      my $l = $_;
      chomp($l);
      if ($l =~ /state = (.*)/) {
        $state = $1;
        $hadReport = 1;
      }
      if (/jobs = (.*)/) {
        my $jobs = $1;
        my @jobs = split(/,/,$jobs);
        $jobCount = $#jobs + 1;
      }
    }
    close(PBSNODES);
   
    print("Result of check on $n: hadReport - $hadReport, state - $state, jobCount - $jobCount\n");
    if (($hadReport) && ($state eq 'offline') && ($jobCount ==0)) {
      return $n;
    }
  }
  return "";
}

#-----------------------------------------
sub selectSome($) {
  my $max = shift;
  my @some = ();
  for (my $ii = 0; $ii < $max; $ii++) {
    if (defined($nodesToDo[0])) {
      push(@some,shift(@nodesToDo));
    }
  }
  return @some;
}

#-----------------------------------------
sub checkOk(){
  my @nodes = @_;
 
  foreach my $n (@nodes) {
    my $actualNode = 0;
    my $state      = "";
    open(PBSNODES,"pbsnodes $n|") or die("Could not run pbsnodes, $!\n");
    while() {
      if (/state = (.*)/) {
        $state = $1;
        $actualNode = 1;
      }
    }
    close(PBSNODES);
    if (! $actualNode) {
      die("Node $n was not an actual one!\n");
    }
    if ($state =~ /offline/) {
      die ("Node $n was already offline!\n");
    }
  }
  return @nodes;
}

#-----------------------------------------
sub initParams() {

  GetOptions ('h|help'       =>   \$parameter{'HELP'},
              'n:s'          =>   \$parameter{'NODES'} ,
              's:i'          =>   \$parameter{'SLICE'} ,
              );

  if (defined($parameter{'HELP'})) {
    print <
Abstract: A tool to drain and boot a bunch of nodes

  -h  --help                  Prints this help page
  -n                 nodes    File of nodes to boot
  -s                 slice    Size of slice to offline at once

TEXT
    exit(0);
  }

  if (!defined($parameter{'SLICE'})) {
    $parameter{'SLICE'} = 5;
  }

  if (!defined($parameter{'NODES'})) {
    die("Please give a file of nodes to reboot\n");
  }

  if (! -s  $parameter{'NODES'} ) {
    die("Please give a real file of nodes to reboot\n");
  }
}
#-----------------------------------------
sub rebootNode($) {
  my $nodeToBoot = shift();
  my $nodeToCheck = $nodeToBoot;
  my $pbsnodesWorked = 0;
  my $hasJobs        = 0;
  open(PBSNODES,"pbsnodes $nodeToCheck|");
  while()  {
    if (/state =/) {
      $pbsnodesWorked = 1;
    }
    if (/^\s*jobs = /) {
      $hasJobs = 1;
    }
  }
  close(PBSNODES);
  if (! $pbsnodesWorked) { return 0; }
  if (  $hasJobs       ) { return 0; }

  open(REBOOT,"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=10 $nodeToBoot reboot|");
  while() {
    print;
  }
  return 1;
}

----- post-snakey.pl ----------------------

#!/usr/bin/perl

use strict;
use Fcntl ':flock';
use Getopt::Long;

my %offlineTimes;

while ( 1 ) {
  %offlineTimes = getOfflineTimes();
  my @a=keys(%offlineTimes);
  my $count = $#a;

  if ($count == -1 ) {
    print("No work to do\n");
    exit(0);
  }
 
  foreach my $n (keys(%offlineTimes)) {
 
    my $uptime = -1;
    open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n cat /proc/uptime 2>&1|");
    while() {
      if (/([0-9\.]+)\s+[0-9\.]+/) {
        $uptime = $1;
      }
    }
    close(B);
    if ($uptime == -1) {
      print("Refusing to remove $n because it may not have been rebooted\n");
    }
    else {
      my $offlineTime = $offlineTimes{$n};
      my $timeNow = time();
      if ($timeNow - $uptime <= $offlineTime ) {
        print("Refusing to remove $n. ");
        printf("Last reboot - %6.3f  days ago. ", $uptime / 24 / 60 /60);
        printf("Offlined    - %6.3f  days ago.\n", ($timeNow - $offlineTime)  / 24 / 60 /60);
      }
      else {
        print("$n has been rebooted\n");
        open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n ./testnode.sh|");
        while() { }
        close(B);
        my $status = $? >> 8;
        if ($status == 0) {
          print("$n passes testnode.sh; will remove from exemptions\n");
          removeFromExemptions($n);
        }
        else {
          print("$n is not passing testnode.sh - $status\n");
        }
      }
    }
  }
  sleep 567;
}

#-----------------------------------------
sub getOfflineTimes() {
  my %offlineTimes = ();
  open(TN,"
  while() {
    if (/(\S+)\s+\# snakey.pl put this offline (\d+)/) {
      $offlineTimes{$1} = $2;
    }
  }
  close(TN);
  return %offlineTimes;
}

#-----------------------------------------
sub removeFromExemptions($) {

  my $node = shift();

  open(TN,"
  my @lines = ;
  close( TN );
  open(TN,">/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
  flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
  foreach my $line ( @lines ) {
    print TN $line unless ( $line =~ m/$node/ );
  }
  close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}

Tuesday, 15 April 2014

Kernel Problems at Liverpool

Introduction

Liverpool recently updated its cluster to SL6. In doing so, a problem occurred whereby the kernel would experience lockups during normal operations. The signs are unresponsiveness, drop-outs in Ganglia and (later) many "task...blocked  for 120 seconds" msgs in /var/log/m.. and dmesg.


Description

Kernels in the range 2.6.32-431* exhibited a type of deadlock when run on certain hardware with BIOS dated after 8th March 2010.

This problem occured on Supermicto hardware, main boards:
  • X8DTT-H
  • X9DRT

Notes:

1) No hardware with BIOS dated 8th March 2010 or before showed this defect, even on the same board type.

2) The oldest kernel of the 2.6.32-358 range is solid. This is corroborated by operational experience with the 358 range.

3) All current kernels in the 2.6.32-431 range exhibited the problem on our newest hardware, and a few nodes of the older hardware that had had unusual BIOS updates.

Testing

The lock-ups are hard to reproduce, but after a great deal of trail and error,  a ~ 90% effective predictor was found.

The procedure is to:

  • Build the system completely new in the usual way and 
  • When yaim gets to "config_user", use a script (stress.sh) to run 36 threads of gzip and one of iozone. 

On a susceptible node, this is reasonably certain to make it lock up after a minute. The signs are unresponsiveness and (later) "task...blocked  for 120 seconds" msgs in /var/log/m.. and dmesg.

I  observed that if the procedure is not followed "exactly", it is unreliable as a predictor. In particular, if you stop Yaim and try again, the predictor is useless.

To test that, I isolated the config_users script from Yaim, and ran it separately along with the stress.sh script. Result: useless - no lock-ups were seen.

Note: This result was rather unexpected because the isolated config_users.sh script works in the same way as the original.





Unsuccessful Theories

A great many theories were tested and rejected or not pursued further (APIC problems, disk problems, BIOS differences,various kernels, examination of kernel logs, much googling etc. etc.) Eventually, a seemingly successful theory was stumbled upon which I describe below.

The Successful Theory

All our nodes had unusual vm settings:

# grep dirty /etc/sysctl.conf
vm.dirty_background_ratio = 100
vm.dirty_expire_centisecs = 1800000
vm.dirty_ratio = 100


These custom settings facilitate the storage of atlas "short files" in RAM. Basically, they force files to remain off disk for a long time, allowing very fast access.

The modification had been tested almost exhaustively for several years on earlier kernels - but perhaps some change (or latent bug?) in the kernel had invalidated them somehow.

We came up with the idea that the issue originates in the memory operations that occur prior to Yaim/config_users. This would explain why anything but the exact activity created by the procedure might well not trigger the defect. We thought this could  tally with the idea of the ATLAS "short file" modifications in sysctl.conf. The theory is that these mods set up the problem during the memory/read/write operations (i.e. the asynchronous OS loading and flushing of the page cache).

 To test this, I used the predictor on susceptible nodes , but without applying the ATLAS "short file" patch.  Default vm settings were adopted instead.

Result

Very satisfying at last - absolutely no sign on the defect. As the ATLAS "short file" patch is not very beneficial given the current data traffic, we have decided to go back to default "vm.dirty" settings and monitor the situation carefully.