Friday 28 November 2008

Nagios Checker

Nagios checker is a firefox plugin that can replace nagios email alerts with colourful, blinking and possibly noisy icons at the bottom of your firefox window.

The icon expands into prolematic hosts lines when the cursor goes on top of it and with the right permissions if you click on the hosts it goes to the hosts nagios page.

https://addons.mozilla.org/en-US/firefox/addon/3607

To configure it to read NorthGrid nagios:

* Go to settings (right-click on the nagios icon on firefox)

* Click Add new

* In the General tab:
** Name: whatever you like
** WEB URL: https://niles004.tier2.hep.manchester.ac.uk/nagios
** Tick the nagios older than 2.0 box
** User name: your DN
** Status script URL: https://niels004.tier2.hep.manchester.ac.uk/nagios/cgi-bin/status.cgi
** Click Ok

If you want only your site machines

* Go to the filters tab and
** Tick the 'Hosts matching regular expressions' box
** Insert your domain name in the test-box
** Tick the reverse expression box.
** Click ok

For the rest you can adjust it as you please I removed the noises and inserted a 3600 sec refresh interval.

Drawbacks if you add other nagios end-points:

* Settings are applied to all of them
* If the host names in your other nagios have a different domain name (or don't have it at all) they don't get filtered.

Perhaps another method might be needed. Investigating.

Manchester and Lhcb

Manchester is now officially part of Lhcb and all they CPU hours will have full weight!! Yuppieee!! :)

Thursday 27 November 2008

Black holes detection

Finding out black holes is always a pain... However the pbs accounting records can be of help. A simple script that counts the number of jobs a node swallows makes some difference:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/jobs/black-holes-finder.sh

I post it just in case other people are interested.

An example of the output:

# black-holes-finder.sh

Using accounting file 20081127
[...]
bohr5029: 1330
bohr5030: 1803


clearly the two nodes above have a problem.

MPI enabled

Enabled MPI in Manchester using YAIM and the recipe from Stephen Childs I found in the links below:

http://www.grid.ie/mpi/wiki/YaimConfig

http://www.grid.ie/mpi/wiki/SiteConfig

Caveats:

1) The documentation will move to probably the YAIM official pages

2) The location of the gip files is now under /opt/glite not /opt/lcg

3) Scripts DO interfere with the current setup on the WNs if run on their own so you need to reconfigure the whole node (I made the mistake of running only MPI_WN). On the CE instead it's enough to run MPI_CE.

4) MPI_SUBMIT_FILTER variable in site-info.def is not documented (yet). It enables the section of the scripts that rewrites torque submit_filter script that allocates the correct number of CPUs

5) Yaim doesn't publish MPICH (yet?) so I had to add the following lines

GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7

to /opt/glite/etc/gip/ldif/static-file-Cluster.ldif manually.

Tuesday 25 November 2008

Regional nagios update

I reinstalled the regional nagios with Nagios3 and it works now.

https://niels004.tier2.hep.manchester.ac.uk/nagios

As suggested by Steve I'm also trying the nagios checker plugin

https://addons.mozilla.org/en-US/firefox/addon/3607

instead of the email notification but I still have to configure things properly. At the moment firefox makes some noise every ~30 seconds and there is also a visual alert on the bottom right corner of the firefox window with the number of services in critical state which expands to show the services when the cursor points at it. Really nice. :)

Thursday 20 November 2008

WMS talk

I gave a talk about the WMS for the Manchester users benefit. It might be of interest to other people.

The talk can be found here:

WMS Overview

Monday 10 November 2008

DPM "File Not Found"- but it's right there!

Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:
"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"

However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:

http://fal-pygrid-17.lancs.ac.uk:8123/ganglia/?r=day&sg=&c=LCG-ServiceNodes&h=fal-pygrid-30.lancs.ac.uk

The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.

Friday 7 November 2008

Regional nagios

I installed a Regional nagios yesterday it turns out to be actually quite easy and the nagios group being quite helpful. I followed the tutorial given by Steve at EGEE08

https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial

I updated it while I was going along instead of writing a parallel document.

Below the URL of the test installation. It might get reinstalled few times to test other features in the next few days.

https://niels004.tier2.hep.manchester.ac.uk/nagios