Monday, 28 April 2014

Snakey - a mindless way to reboot the cluster

Introduction

I'm fed up with all the book-keeping when I need to reboot or rebuild our cluster.

First I need to set a subset of nodes offline. Then I have to monitor them until some are drained. Then, as soon as any is drained, I have to reboot it by hand, then wait for it to build, then test it and finally put it back online, Then I choose another set (maybe a rack) and go through the same thing over and over until the cluster is done.

So, to cut all that, I've written a pair of perl scripts, called snakey.pl and post_snakey.pl. I run each (at the same time) in a terminal and they do all that work for me, so I can do other things, like Blog Posts. Start snakey.pl first.

Note: all this assumes the use of the test nodes suite written by Rob Fay, at Liverpool.

Part 1 – Snakey

This perl script, called snakey.pl, reads a large list, and puts a selection offline with testnodes. It drains them, and reboots them once drained. For each one that gets booted, another from the list is offlined. In this way, it "snakes" through the selected part of the cluster. Our standard buildtools+puppet+yaim system takes care of the provisioning.

Part 2 – Post Snakey

Another script, post_snakey.pl, tells if the nodes have been rebooted by snakey, and if they pass the testnodes test script. Any that do are put back on , so they come online. The scripts have some safety locks to stop havoc breaking out. They usually just stop if anything weird is seen.

Part 3 – Source Code

You've seen all the nice blurb, so here's the source code. I've had to fix it up because HTML knackers the "<", ">" and "&" chars - I hope I haven't broken it.

Note: not the cleanest code I've ever written, but it gets the job done.

Good luck!


----- snakey.pl ----------------------
#!/usr/bin/perl

use strict;
use Fcntl ':flock';
use Getopt::Long;

sub initParams();

my %parameter;

initParams();

my @nodesToDo;

open(NODES,"$parameter{'NODES'}") or die("Cannot open file of nodes to reboot, $!\n");
while() {
  chomp($_);
  push(@nodesToDo,$_);
}
close(NODES);

checkOk(@nodesToDo);

my @selection = selectSome($parameter{'SLICE'});
foreach my $n(@selection) {
  print "Putting $n offline\n";
  putOffline($n);
}

while( $#selection > -1) {

  my $drainedNode = '';
  while($drainedNode eq '') {
    sleep( 600 );
    $drainedNode = checkIfOneHasDrained(@selection);
  }
 
  @selection = remove($drainedNode,@selection);

  print("Rebooting $drainedNode\n");
  my $status = rebootNode($drainedNode);
  print("status -- $status\n");

  my @nextOne = selectSome(1);
  if ($#nextOne == 0) {
    my $nextOne = $nextOne[0];
    print "Putting $nextOne offline\n";
    putOffline($nextOne);
    push(@selection,$nextOne);
  }
}
#-----------------------------------------
sub putOffline() {
  my $node = shift();
  open(TN,"/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
  while() {
    my $l = $_;
    chomp($l);
    $l =~ s/#.*//;
    $l =~ s/\s*//g;
    if ($node =~ /^$l$/) {
      print ("Node $node is already in testnodes-exemptions.txt\n");
      return;
    }
  }
  close(TN);
  open(TN,">>/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
  flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
  print (TN "$node # snakey.pl put this offline " . time() . "\n");
  close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}
#-----------------------------------------
sub remove() {
  my $drained = shift();
  my @poolOfNodes = @_;

  my @newSelection = ();
  foreach my $n (@poolOfNodes) {
    if ($n !~ /$drained/) {
      push(@newSelection,$n);
    }
  }
  die("None removed\n") unless($#newSelection == ($#poolOfNodes -1));
  return @newSelection;
}

#-----------------------------------------
sub checkIfOneHasDrained(@) {
  my @nodesToCheck = @_;
  foreach my $n (@nodesToCheck) {
    my $hadReport = 0;
    my $state = "";
    my $jobCount = 0;

    open(PBSNODES,"pbsnodes $n|");
    while() {
      my $l = $_;
      chomp($l);
      if ($l =~ /state = (.*)/) {
        $state = $1;
        $hadReport = 1;
      }
      if (/jobs = (.*)/) {
        my $jobs = $1;
        my @jobs = split(/,/,$jobs);
        $jobCount = $#jobs + 1;
      }
    }
    close(PBSNODES);
   
    print("Result of check on $n: hadReport - $hadReport, state - $state, jobCount - $jobCount\n");
    if (($hadReport) && ($state eq 'offline') && ($jobCount ==0)) {
      return $n;
    }
  }
  return "";
}

#-----------------------------------------
sub selectSome($) {
  my $max = shift;
  my @some = ();
  for (my $ii = 0; $ii < $max; $ii++) {
    if (defined($nodesToDo[0])) {
      push(@some,shift(@nodesToDo));
    }
  }
  return @some;
}

#-----------------------------------------
sub checkOk(){
  my @nodes = @_;
 
  foreach my $n (@nodes) {
    my $actualNode = 0;
    my $state      = "";
    open(PBSNODES,"pbsnodes $n|") or die("Could not run pbsnodes, $!\n");
    while() {
      if (/state = (.*)/) {
        $state = $1;
        $actualNode = 1;
      }
    }
    close(PBSNODES);
    if (! $actualNode) {
      die("Node $n was not an actual one!\n");
    }
    if ($state =~ /offline/) {
      die ("Node $n was already offline!\n");
    }
  }
  return @nodes;
}

#-----------------------------------------
sub initParams() {

  GetOptions ('h|help'       =>   \$parameter{'HELP'},
              'n:s'          =>   \$parameter{'NODES'} ,
              's:i'          =>   \$parameter{'SLICE'} ,
              );

  if (defined($parameter{'HELP'})) {
    print <
Abstract: A tool to drain and boot a bunch of nodes

  -h  --help                  Prints this help page
  -n                 nodes    File of nodes to boot
  -s                 slice    Size of slice to offline at once

TEXT
    exit(0);
  }

  if (!defined($parameter{'SLICE'})) {
    $parameter{'SLICE'} = 5;
  }

  if (!defined($parameter{'NODES'})) {
    die("Please give a file of nodes to reboot\n");
  }

  if (! -s  $parameter{'NODES'} ) {
    die("Please give a real file of nodes to reboot\n");
  }
}
#-----------------------------------------
sub rebootNode($) {
  my $nodeToBoot = shift();
  my $nodeToCheck = $nodeToBoot;
  my $pbsnodesWorked = 0;
  my $hasJobs        = 0;
  open(PBSNODES,"pbsnodes $nodeToCheck|");
  while()  {
    if (/state =/) {
      $pbsnodesWorked = 1;
    }
    if (/^\s*jobs = /) {
      $hasJobs = 1;
    }
  }
  close(PBSNODES);
  if (! $pbsnodesWorked) { return 0; }
  if (  $hasJobs       ) { return 0; }

  open(REBOOT,"ssh -o StrictHostKeyChecking=no -o BatchMode=yes -o ConnectTimeout=10 $nodeToBoot reboot|");
  while() {
    print;
  }
  return 1;
}

----- post-snakey.pl ----------------------

#!/usr/bin/perl

use strict;
use Fcntl ':flock';
use Getopt::Long;

my %offlineTimes;

while ( 1 ) {
  %offlineTimes = getOfflineTimes();
  my @a=keys(%offlineTimes);
  my $count = $#a;

  if ($count == -1 ) {
    print("No work to do\n");
    exit(0);
  }
 
  foreach my $n (keys(%offlineTimes)) {
 
    my $uptime = -1;
    open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n cat /proc/uptime 2>&1|");
    while() {
      if (/([0-9\.]+)\s+[0-9\.]+/) {
        $uptime = $1;
      }
    }
    close(B);
    if ($uptime == -1) {
      print("Refusing to remove $n because it may not have been rebooted\n");
    }
    else {
      my $offlineTime = $offlineTimes{$n};
      my $timeNow = time();
      if ($timeNow - $uptime <= $offlineTime ) {
        print("Refusing to remove $n. ");
        printf("Last reboot - %6.3f  days ago. ", $uptime / 24 / 60 /60);
        printf("Offlined    - %6.3f  days ago.\n", ($timeNow - $offlineTime)  / 24 / 60 /60);
      }
      else {
        print("$n has been rebooted\n");
        open(B,"ssh -o ConnectTimeout=2 -o BatchMode=yes $n ./testnode.sh|");
        while() { }
        close(B);
        my $status = $? >> 8;
        if ($status == 0) {
          print("$n passes testnode.sh; will remove from exemptions\n");
          removeFromExemptions($n);
        }
        else {
          print("$n is not passing testnode.sh - $status\n");
        }
      }
    }
  }
  sleep 567;
}

#-----------------------------------------
sub getOfflineTimes() {
  my %offlineTimes = ();
  open(TN,"
  while() {
    if (/(\S+)\s+\# snakey.pl put this offline (\d+)/) {
      $offlineTimes{$1} = $2;
    }
  }
  close(TN);
  return %offlineTimes;
}

#-----------------------------------------
sub removeFromExemptions($) {

  my $node = shift();

  open(TN,"
  my @lines = ;
  close( TN );
  open(TN,">/root/scripts/testnodes-exemptions.txt") or die("Could not open testnodes.exemptions.txt, $!\n");
  flock(TN, LOCK_EX) or die "Could not lock /root/scripts/testnodes-exemptions.txt, $!";
  foreach my $line ( @lines ) {
    print TN $line unless ( $line =~ m/$node/ );
  }
  close(TN) or die "Could not write /root/scripts/testnodes-exemptions.txt, $!";
}

No comments: