Tuesday, 24 March 2009

Replaced NFS servers

The NFS servers have been replaced in Manchester with two more powerful machines and two 1TB raided SATA disks. This should hopefully put a stop to the space problems we have suffered in the past few months both with atlas and lhcb and should also allow us to keep a bit more releases than before.

We also have a nice nagios graphs to monitor the space now as well as cfengine alerts.

http://tinyurl.com/d5n7eo

Thursday, 5 March 2009

Machine room update

After some sweet-talking we managed to get two extra air-con units installed in our old machine room. This room houses our 2005 cluster and our more recent CPU and storage purchased last year. The extra cooling was noticeable and allowed us to switch on a couple of racks which were otherwise offline.


In other news, the new data centre is coming along nicely and will be ready for handover in 3/4 months from now. If you're ever racing past Lancaster on the M6 you'll get a good view of the Borg mothership on the hill, the sleak black cladding is going up now...

Friday, 13 February 2009

This week's DPM troubles at Lancaster.

We've had some interesting time this week in Lancaster, a tale of Gremlins, Greedy Daemons and Magical Faeries who come in the night and fix your DPM problems.

On Tuesday evening, when we've all gone home for the night, the DPM srmv1 daemon (and to a lesser extent the srmv2.2 and dpm daemons) started gobbling up system resources, sending our headnode into a swapping frenzy. There are known memory leak problems in the DPM code, and we've been victim of them before but in those instances we've always been saved by a swift restart of the affected services and the worse that happened was a sluggish DPM. This time the DPM servies completely froze up, and around 7 pm we started failing tests.

So coming into this disaster on Wednesday morning we leaped into action. Restarting the services fixed the load on the headnode, but the DPM still wouldn't work. Checking the logs showed that all requests were being queued, apparently forever. The trail led to some error messages in the mysqld.log;

090211 12:05:37 [ERROR] /usr/libexec/mysqld: Lock wait timeout exceeded;
try restarting transaction
090211 12:05:37 [ERROR] /usr/libexec/mysqld: Sort aborted

The oracle Google pointed that these kind of errors were indicative of a mysql server in a bad state after suddenly loosing connection to a client but not accounting for this. Various restarts, reboots and threats were used, but nothing would get the dpm working and we had to go into downtime.

Rather then dive blindly into the bowels of the DPM backend mysql we got in contact with the DPM developers on the DPM support list. They were really quick to respond, and after recieving 40MB of (zipped!) log files from us set to work developing a strategy to fix us. It appears that our mysql had grown much larger then it should have, "bloating" with historical data, which contributed to it getting into a bad state and made the task of repairing the database harder- partly as we simply couldn't restore from backups as these too would be "bloated".

After a while of bashing our heads, scouring logs and waiting for news from the DPM chaps we decided to make use of the downtime and upgrade the RAM on our headnode to 4 GB (from 2), a task we had been saving for the scheduled downtime when we finally upgrade to the Holy Grail that is DPM 1.7.X. So we slapped in the RAM, brung the machine up clearly, and left it.

A bit over an hour after it came up after the upgrade the headnode started working again. As if by magic. Nothing notable in the logs, it just started working again. The theory is that the added RAM allowed the mysql to chug through a backlog of requests and start working again. But that's just speculation. The dpm chaps are still puzzling over what happened, and our databases are still bloated, but the crisis (for now).

So there are 2 morals to this tale;
1) I wouldn't advise running a busy DPM headnode with less then 4GB of RAM, it leads to unpredictable behaviour.
2) If you get stuck in an Unscheduled Downtime you might as well make use of it to do any work, you never know when something magical might happen!

Monday, 9 February 2009

Jobmanager pbsqueue cache locked

Spent last week tracking down a problem where jobs were finishing in the batch system but the jobmanager wasn't recognizing this. This meant that jobs never 'completed' which had two major impacts, 1. Steve's test jobs all failed through timeouts and 2. Atlas production stopped because it looked like the pilots never completed and no further pilots were sent.

Some serious detective work was undertaken my Maarten and Andrey and it turned out the pbscache wasn't being updated due to a stale lock file in the ~/.lcgjm/ directory. The lock files can be found with this on the CE:

find /home/*/.lcgjm/pbsqueue.cache.proc.{hold.localhost.*,locked} -mtime +7 -ls

We had 6 users affected (alas, our important ones!), all with lock files dated Dec 22. Apparently the lcgpbs script Helper.pm would produce these whenever hostname returned 'localhost'. Yes, on December 22 we had maintenance work with DHCP unavailable, and for some brief period the CE hostname was 'localhost'. Note this is lcg-CE under glite-3.1. Happy days are here again!

Wednesday, 17 December 2008

Manchester General Changes

* Enabled users pilots for Atlas and Lhcb. Currently Lhcb is running a lot of jobs and although most is production many are from their generic users. Atlas instead seems to have almost disappeared.

* Enabled NGS VO and passed the first tests. Currently in the conformance test week.

* Enabled one shared queue and completely phased out the VO queues. This has required a transition period for some VOs to give the time to clear the jobs from the old queue and or to reconfigure their tools. This has greatly simplified the maintainance.

* Installed a top-level BDII and reconfigured the nodes to query the local top level BDII instead of the RAL one. This was actually quite easy and we should have done it earlier.

* Cleaned up old parts of cfengine that were causing the servers to be overloaded, not serve the nodes correctly and fire off thousands of emails a day. Mostly this was due to an overlap in the way cfexecd was run both as a cron job and as a daemon. However we also increased TimeOut and SplayTime values and introduced explicitely the schedule parameter in cfagent.conf. Since then cfengine hasn't had anymore problems.

* Increased usage of YAIM local/post functions to apply local overrides or minor corrections to yaim default. Compared to inserting the changes in cfengine this method has the benefit of being integrated and predictable. When we run yaim the changes are applied immediately and don't get overridden.

* New storage: our room is full and when the clusters are loaded we hit the power/cooling limit and we risk to draw power from other rooms, due to this problem the CC people don't want us to switch on new equipment without switching off some old one to maitain the balance in the power consumption. So eventually we have bought 96 TB of raw space to get us going. The kit has arrived yesterday and needs to be installed in the rack we have and power measures need to be taken to avoid switching off more than the necessary amount of nodes. Luckily it will not be many anyway (even taking the nominal value on the back of the new machines it would be 8 nodes but with better power measures it could be as many as 4) because the new machines consume much less then the DELL nodes that are now 4 years old. However buying new CPUs/storage cannot be done without switching off a significant fraction of the current CPUs before switching on the new kit and requires working in tight cooperation with CCS people which has now been agreed after a meeting I had last week with them and their management.

Tuesday, 16 December 2008

Phasing out VO queues

I've started to phase out VO queues and to create VO shared queues. The plan is eventually to have 4 queues called with a leap of imagination short, medium, long and test with the following characteristics:

test: 3h/4h; all local VOs
short: 6h/12h; ops,lhcbsgm,atlasgm
medium: 12h/24h all VOs and roles but those in short queue and production
long: 24h/48h all VOs and roles but those that can access the short queue

Installing the queues, adding the groups ACLs and publishing them is not difficult. YAIM (glite-yaim-core-4.0.4-1, glite-yaim-lcg-ce-4.0.4-2 or higher) can do it for you. Otherwise it can be done by hand which is still easy but is more difficult to maintain (the risk to override is always high and files need to be maintained in cfengine or cvs or else).

The problem for me is that this scheme works only if the users select the correct ACLs and a suitable queue with the right length for their jobs in their JDL. If they don't the queue chosen by the WMS is random with high probability of jobs failing because they end up in a queue that is too short or into a queue that doesn't have the right ACLs. So I'm not sure if it's really a good idea even if it is much easier to maintain and allows a bit more sophisticated setups.

Anyway if you do it by YAIM all you have to do is to add the queue to

QUEUES="my-new-queue other-queues"

add the right VO/FQAN to the new queue _GROUP_ENABLE variable (remember to convert . and - into _

MY_NEW_QUEUE_GROUP_ENABLE="atlas /atlas/ROLE=pilot other-vos-or-fqans"

the syntax of GROUP_ENABLE has to be the same as the one you have used in group.conf (see previous post http://northgrid-tech.blogspot.com/2008/12/groupsconf-syntax.html)

And finally add to site-info.def

FQANVOVIEWS=yes

to enable publishing of the ACL in the GIP.

Rerun YAIM on the CE as normal.

To check everything is ok on the CE

qmgr -c 'p q my-new-queue'

ldapsearch -x -H ldap://MY-CE.MY-DOMAIN:2170 -b GlueCEUniqueID=MY-CE.MY-DOMAIN:2119/jobmanager-lcgpbs-my-new-queue,Mds-Vo-name=resource,o=grid

among other things, if correctly configured it should list the GlueCEAccessControlBaseRules for each VO and FQAN you have listed in _GROUP_ENABLE.

If a
GlueCEAccessControlBaseRule: DENY:FQAN field appears that's the ACL for VOViews not the access to the queue.

Thanks to Steve and Maria for pointing me to the right combination of YAIM packages and confirming the randomness WMS matchmaking.

Monday, 15 December 2008

groups.conf syntax

Elena asked about it few days ago on TB-SUPPORT. Today I investigated a bit further and the result is that for glite-yaim-core versions >4.0.4-1:

* Even if it still works the syntax with VO= and GROUP= is obsolete. The new syntax is much simpler as it uses directly the FQANs as reported in the VO cards (if they are maintained).

* The syntax in /opt/glite/yaim/examples/groups.conf.example is correct and the files in the directory are kept up to date with the correct syntax although the examples might not be valid.

* Further information can be found either in

/opt/glite/examples/groups.conf.README

or

https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Group_configuration_in_YAIM

which is worth to periodically review for changes.

Monday, 1 December 2008

RFIO tuning for Atlas analysis jobs

A little info about the RFIO settings we've tested at Liverpool.

Atlas analysis jobs running on a site using DPM use POSIX access through the RFIO interface. ROOT (since v5.16 IIRC) has support for RFIO access and uses the buffered access mode READBUF. This allocates a static buffer for files read via RFIO on the client. By default this buffer is 128kB.

Initial tests with this default buffer size showed a low cpu efficiency and a high rate of bandwidth usage, far more than the size of the files being accessed. The buffer size can be altered by including a file on the client called /etc/shift.conf containing

RFIO IOBUFSIZE XXX

where XXX is the size in bytes. Altering this setting gave the following results

Buffer (MB), CPU (%), Data transferred (GB)
0.125, 60.0, 16.5
1.000, 23.0, 65.5
10.00, 13.5, 174.0
64.00, 62.1, 11.5
128.0, 74.7, 7.5

This was on a test data set with file sizes of ~1.5GB and using athena 14.2.10.

Using buffer sizes of 64MB+ gives gains in efficiency and required bandwidth. A 128MB buffer is a significant chunk of a worker node's RAM, but as the files are not being cached in the linux file cache the ram usage is likely similar to accessing the file from local disk, and the gains are large.

For comparison the same test was run from a copy of the files on local disk. This gave a cpu efficiency of ~50% but the event rate was ~8 times slower than when using RFIO.

My conclusions are that RFIO buffering is significantly more efficient than standard linux file caching. The default buffer size is insufficient and increasing by small amounts greatly reduces efficiency. Increasing the buffer to 64-128MB gives big gains without impacting available RAM too much.

My guess about why only a big buffer gives gains may be due to the random access on the file by the analysis job. Reading in a small chunk, eg 1MB, may buffer a whole event but the next event is unlikely to be in that buffered 1MB, so another 1MB has to be read in for the next event. Similarly for 10MB, although this time the amount read in each time is 10x as much but with a less than 10x increase in probability of the event being in the buffer. When the buffer reaches 64MB the probability of an event being in the buffered area is high enough to offset the extra data being read in.

Another possibility is that the buffering only buffers the first xMB of the file, hence a bigger buffer means more of the file is in RAM and there's a higher probability of the event being in the buffer. Neither of these hypotheses have been investigated further yet.

Large block reads are also more efficient when reading in the data than lots of small random reads. The efficiency effectively becomes 100% if the buffer size is >= the dataset file size; the first reads pull in all of the file and all reads from then are from local RAM.

This makes no difference to the impact on the head node for eg SURL/TURL requests, only the efficiency of the analysis job accessing the data from the pool nodes and the required bandwidth (our local tests simply used the rfio:///dpm/... path directly). If there are enough jobs there will still be bottle necks on the network, either at switch or pool node. We have given all our pool nodes at least 3Gb/s connectivity to the LAN backbone.

The buffer size setting will give different efficiency gains for different file sizes (ie the smaller filesize, the better the efficiency), eg the first atlas analysis test had smaller file sizes than our tests and showed much higher efficiencies. The impact of the BUFSIZE setting on other VO's analysis jobs that use RFIO hasn't been tested.

Friday, 28 November 2008

Nagios Checker

Nagios checker is a firefox plugin that can replace nagios email alerts with colourful, blinking and possibly noisy icons at the bottom of your firefox window.

The icon expands into prolematic hosts lines when the cursor goes on top of it and with the right permissions if you click on the hosts it goes to the hosts nagios page.

https://addons.mozilla.org/en-US/firefox/addon/3607

To configure it to read NorthGrid nagios:

* Go to settings (right-click on the nagios icon on firefox)

* Click Add new

* In the General tab:
** Name: whatever you like
** WEB URL: https://niles004.tier2.hep.manchester.ac.uk/nagios
** Tick the nagios older than 2.0 box
** User name: your DN
** Status script URL: https://niels004.tier2.hep.manchester.ac.uk/nagios/cgi-bin/status.cgi
** Click Ok

If you want only your site machines

* Go to the filters tab and
** Tick the 'Hosts matching regular expressions' box
** Insert your domain name in the test-box
** Tick the reverse expression box.
** Click ok

For the rest you can adjust it as you please I removed the noises and inserted a 3600 sec refresh interval.

Drawbacks if you add other nagios end-points:

* Settings are applied to all of them
* If the host names in your other nagios have a different domain name (or don't have it at all) they don't get filtered.

Perhaps another method might be needed. Investigating.

Manchester and Lhcb

Manchester is now officially part of Lhcb and all they CPU hours will have full weight!! Yuppieee!! :)

Thursday, 27 November 2008

Black holes detection

Finding out black holes is always a pain... However the pbs accounting records can be of help. A simple script that counts the number of jobs a node swallows makes some difference:

http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/jobs/black-holes-finder.sh

I post it just in case other people are interested.

An example of the output:

# black-holes-finder.sh

Using accounting file 20081127
[...]
bohr5029: 1330
bohr5030: 1803


clearly the two nodes above have a problem.

MPI enabled

Enabled MPI in Manchester using YAIM and the recipe from Stephen Childs I found in the links below:

http://www.grid.ie/mpi/wiki/YaimConfig

http://www.grid.ie/mpi/wiki/SiteConfig

Caveats:

1) The documentation will move to probably the YAIM official pages

2) The location of the gip files is now under /opt/glite not /opt/lcg

3) Scripts DO interfere with the current setup on the WNs if run on their own so you need to reconfigure the whole node (I made the mistake of running only MPI_WN). On the CE instead it's enough to run MPI_CE.

4) MPI_SUBMIT_FILTER variable in site-info.def is not documented (yet). It enables the section of the scripts that rewrites torque submit_filter script that allocates the correct number of CPUs

5) Yaim doesn't publish MPICH (yet?) so I had to add the following lines

GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7

to /opt/glite/etc/gip/ldif/static-file-Cluster.ldif manually.

Tuesday, 25 November 2008

Regional nagios update

I reinstalled the regional nagios with Nagios3 and it works now.

https://niels004.tier2.hep.manchester.ac.uk/nagios

As suggested by Steve I'm also trying the nagios checker plugin

https://addons.mozilla.org/en-US/firefox/addon/3607

instead of the email notification but I still have to configure things properly. At the moment firefox makes some noise every ~30 seconds and there is also a visual alert on the bottom right corner of the firefox window with the number of services in critical state which expands to show the services when the cursor points at it. Really nice. :)

Thursday, 20 November 2008

WMS talk

I gave a talk about the WMS for the Manchester users benefit. It might be of interest to other people.

The talk can be found here:

WMS Overview

Monday, 10 November 2008

DPM "File Not Found"- but it's right there!

Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:
"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"

However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:

http://fal-pygrid-17.lancs.ac.uk:8123/ganglia/?r=day&sg=&c=LCG-ServiceNodes&h=fal-pygrid-30.lancs.ac.uk

The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.

Friday, 7 November 2008

Regional nagios

I installed a Regional nagios yesterday it turns out to be actually quite easy and the nagios group being quite helpful. I followed the tutorial given by Steve at EGEE08

https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial

I updated it while I was going along instead of writing a parallel document.

Below the URL of the test installation. It might get reinstalled few times to test other features in the next few days.

https://niels004.tier2.hep.manchester.ac.uk/nagios

Wednesday, 27 August 2008

DPM in Manchester

Manchester has now a fully working DPM with 6 TB. There are 2 space tokens ATLASPRODDISK and ATLASDATADISK. The service has been added to the GOC and the Information System, the space tokens are published. The errors have been corrected and the system has been passing the SAM tests continuously since yesterday.

I added some information to the wiki

https://www.gridpp.ac.uk/wiki/Manchester_DPM#Atlas_Space_tokens
https://www.gridpp.ac.uk/wiki/Manchester_DPM#Errors_along_the_path

Monday, 21 July 2008

DPM space tokens

Below the first tests of setting up space tokens on DPM testbed:

https://www.gridpp.ac.uk/wiki/Manchester_DPM

Tuesday, 3 June 2008

Slaughtering ATLAS jobs?

These heavy ion jobs have generated lots of discussion in various forums. They are ATLAS heavy ion simulations (Pb-Pb) which are being killed in two ways. (1) by the site's batch system if the queue walltime limit is reach. (2) by the atlas pilot because the log file modification time hasn't change in 24 hrs.

Either way, sites shouldn't worry if they see these, ATLAS production is aware. They're only single event and you might see massive mem usage too > 2G/core. :-)

According to Steve, the new WN should allow jobs to gracefully handle batch kills with a suitable delay between SIGTERM and SIGKILL.

Friday, 23 May 2008

buggy glite-yaim-core

glite-yaim-core to version >4.0.4-1 doesn't recognise anymore VO_$vo_VOMSES even if set correctly in the vo.d dir. I'm still wondering how the testing is performed. Primary functionalities like completion without self-evident errors seems to be overlooked. Anyway, looking on the bright side... the bug will be fixed in yaim-core 4.0.5-something. In the meantime I had to downgrade

glite-yaim-core to 4.0.3-13
glite-yaim-lcg-ce to 4.0.2-1
lcg-CE to 3.1.5-0

which was the combination previously working.

CE problems with lcas an update

lcas/lcmaps they have by default debug level set to 5. Apparently it can be changed setting appropriate env variables for the globus-lcas-lcamaps interface. The variables are actually foreseen in yaim. The errors can be generated by a mistyped DN when a non-VOMS proxy is used. This is a very easy way to generate a DoS attack.

After dwngrading yaim/lcg-CE I've reconfigured the CE and it seems to be working now. I haven't seen any of the debug messages for now.

Thursday, 22 May 2008

globus-gatekeeper weirdness

globus-gatekeeper has started to spit out level 4 lcas/lcmaps messages from nowhere at 4 o'clock in the morning. The log file reaches few GBs size in few hours and fills /var/log screwing the CE. I contacted nikhef people for help but haven't received an answer yet. The documentation is not helpful.

Monday, 19 May 2008

and again

we have upgraded to 1.8. At least we can look at the space manager while waiting for a solution to the number of jobs cap and the replica manager not replicating.

Thursday, 15 May 2008

Still dcache

With Vladimir help Sergey managed to start the replica manager changing a java option in replica.batch. This is as far as it goes because it still doesn't work, i.e. it doesn't produce replicas. We have just given Vladimir access to the testbed.

It seems Chris Brew is having the same 'cannot run more than 200 jobs' problem since he upgraded. He sent an email to the dcache user-forum. This makes me think that even if the replica manager might help it will not cure the problem.

Tuesday, 13 May 2008

Manchester dcache troubles (2)

fnal developers are looking at the replica manager issue. The error lines found in the admindomain log appear also at fnal and don't seem to be a problem there. The search continues...

In the meantime we have doubled the memory of all dcache head nodes.

Monday, 12 May 2008

The data centre in the sky

Sent adverts out this week, trying to shift our ancient DZero farm which was de-commissioned a few years ago. It had a glorious past with many production records set whilst generating DZero's RunII simulation data. It's dual 700MHz PentiumIII CPUs, with 1G RAM can't cope with much these days, and it's certainly not worth the manpower keeping them online. Here is the advert if you're interested.

In other news, our MON box system disk spat it's dummy over the weekend, this was one of the three gridpp machines, not bad going after 4 years.

Tuesday, 6 May 2008

Manchester SL4 dcache troubles

Manchester since the upgrade to SL4 is experiencing problems with dcache.

1) pnfs doesn't seem to take a load beyond 200 atlas jobs (it times out) Alessandra has been unable to replicate the problem production is seeing. Even starting 200 clients at the same time on the same file production is using all she could see was a transfer time increased from 2 secons to ~190 seconds but no time out. On saturday when she looked on the dashboard she has found 99.8% of ~1790 jobs successfully completed in the last 24 hours which also sounds in contradiction with the 200 jobs at the time statistics and needs to be explained.

2) replica manager doesn't work anymore, i.e. it doesn't even start, no resilience active. The error is a java InitSQL that should have been caused by the lack of a parameter according to the dcache developers. We sent them the requested configuration files and they couldn't find anything wrong with them. We have given access to dcache to Greig and he couldn't see anything wrong either. A developer suggested to move to a newer version of dcache to solve the problem, which we had tried already, but the new version has a new problem. From the errors it seems that the schema has changed, but we didn't obtain a reply to this. In this instance the replica manager starts but cannot insert data in the database. The replica manager obviously helps to cut in half transfer times because there is more than one pool node serving the files (I tested this on the SL3 dcache. 60 concurrent clients take max 35 sec each instead of 70. If the number of clients increases the effect is less but still in the range of 30%. In any case we are talking about a fistful of seconds not in the timeout range as it happens to production.

3) Finally even if all these problems were solved Space Manager isn't compatible with Resilience. So pools with space tokens will not have the benefit of duplicates. Alessandra has asked already 2 months ago what was the policy in case she had to chose. It was agreed that for these initial tests it wasn't a problem.

4) Another problem specific to Atlas is that although Manchester has 2 dcache
instances they have insisted to use only 1 for quite sometime. This has obviously affected production heavily. After a discussion at CERN they agreed finally to split and use both instances but that hasn't happened yet.

5) This is minor but equally important for manchester: VO with DNS style names are mishandled by dcache YAIM. We will open a ticket.

We have applied all the optimization suggested by the developers. Even those not necessary and nothing has changed. the old dcache instance without optimizations and with the replica manager working is taking a load of 300-400
atlas users jobs. According to local users who are using it for their local production both reading from it and writing into it they have an almost 100% rate of success (last week 7 jobs failures over 2000 jobs submitted).

Applied optimizations:

1) Split pnfs from the dcache head node: we can now run 200 production jobs. (but then again as already said the old dcache can take 400 jobs and the head node isn't split)
2) Apply postgres optimizations: no results
3) Apply kernel optimization for networking from CERN: transfers of small files
30% faster but could also be a less loaded cluster.

Most of the problems might come from the attempt of maintaining the old data. So we will try to to install a new dcache instance without it. Although it is not a very sustainable choice, it might help to understand what is the problem.

Wednesday, 9 April 2008

Athena release 14 - new dependency

Athena release 14 has a new dependency on package 'libgfortran'. Sites with local Atlas users may want to check they have this. The runtime error message is rather difficult to decipher, however the buildtime error is explicit. I've added the package to the required packages twiki page.

Monday, 4 February 2008

Power outage

Site-wide power outage occured at Lancaster Uni this evening. The juice is now flowing but some intervention is required tomorrow morning before we're back to normal operations.

I hate java!

#$%^&*@!!!!!!

Wednesday, 30 January 2008

Liverpool update

From Mike reply:

* We'll stay with dcache and are about to rebuild the whole SE (and the whole cluster including a new multi-core CE) when we shut down for a week soon to install SL4. Everything is under test at present and we are upgrading the rack software servers to 250GB RAID1 to cope with the 100GB size of the ATLAS code.

* We are still testing Puppet (on our non-LCG cluster) as our preferred solution. It looks fine but we are not yet sure it will scale to many 100s of nodes.

Thursday, 20 December 2007

Lancaster's Winter Dcache Dramas

It's been a tough couple of months for Lancaster, with our SE giving us a number of problems.

Our first drama, at the start of the month, was caused by unforseen complications with our upgrade to dcache 1.8. Knowing that we were low on the support list due to being only a Tier 2, but feeling emboldened by the highly useful srm 2.2 workshop in Edinburgh and the good few years we've spent in the dcache trenches we decided to take the plunge. And faced a good few days of downtime beyond the one we had scheduled as we faced down a number of bugs with the early versions of dcache 1.8 (fixed by upgrading to higher patch levels), then faced problems due to changes in the gridftp protocol highlighted inconsisencies with the users on our pnfs node and gridftp door node. Due to a hack long ago several VOs had different user.conf entries and therefore UIDs on our door nodes and pnfs node. This never caused problems before, but after the upgrade the doors were passing the uids to the pnfs node so new files and directories were created with the correct group (as the gids were consistent) but a wrong uid, causing permission troubles whenever a delete was called. This was a classic case of a problem that was hell to the cause behind but once figured out was thankfully easy to solve. Once we fixed that one it was green tests for a while.

Then dcache drama number two came along a week later- a massive postgres db failure on our pnfs node. The postgres database contains all the information that dcache uses to match the fairly anonymously named files on the poolnodes to entries in the pnfs namespace- without it dcache has no idea which files are which, so with it bust the files are almost as good as lost. Which is why it should be backed up regularly. We did this twice daily, as least we thought we did- a cron problem had meant that our backups hadn't been made for a while and a rollback to it would mean a fair amount of data might be lost. So we spent 3 days doing arcane sql rituals to try and bring back the database, but it had too heavily corrupted itself and we had to rollback.

The cause of the database crash and corruption was a "wrap around" error. Postgres requires regular "vacuuming" to clean up after itself, otherwise it essentially starts writing over itself. This crash took us by surprise, as we not only have postgres looking after itself with optimised auto-vacuuming occuring regularly, but during the 1.8 upgrade I took the time to do a manual full vacuum, which was only a week before this one. Also postgres is designed to freeze in the event of being at risk of a wraparound error rather then overwrite itself, and this didn't happen. The first we heard of it pnfs and postgres had stopped responding and there were wraparound error messages in the logs, no warning of the impending disaster.

Luckily the data rollback seems to have not affected the VOs too much. We had one ticket from Atlas, who after we explained our situation to them handily cleaned up their file catalogues. The guys over at dcache hinted at a possible way of rebuilding the lost databases from the pnfs logs, although sadly this isn't simply a case of recreating pnfs related sql entries and they've been too busy with Tier 1 support to look into this further.

Since then we've fixed our backups and applied a nagios test to ensure the backups are less then a day old-the biggest trouble here was that the reluctance to use an old backup meant we wasted over 3 days banging our heads trying to bring back a dead database rather then a few hours it would take to restore from backup and verify things were working. And it appears the experiments were more affected by us being in downtime then by the loss of easily replicatable data. In the end I think I caused more trouble going over the top on my data recovery attempts then if I had been gung ho and used the old backup once things looked a bit bleak for the remains of the postgres database. At least we've now set things up so the likeliness of it happening again is slim, but the circumstances behind the original database errors are still unknown, which leaves me a little worried.

Have a good Winter Festival and Holday Season everyone- but before you head off to your warm fires and cold beers check the age of your backups just in case...

Thursday, 6 December 2007

Manchester various

- core path was set to /tmp/core-various-param in sysctl.conf and was creating a lot of problems to dzero jobs. It was also creating problems to others as they were filling /tmp and consequently maradona errors were looming around. The path has been changed back to the default and also I set core size 0 in limits.conf to prevent any other problem repeating itself with a lesser degree in /scratch.

- dcache doors were open on the wrong nodes. node_config is the correct one but it was copied before stopping dcache-core service and now /etc/init.d/dcache-core stop doesn't have any effect. The doors have also a keep alive script so it is not enough to kill the java proesses one has to kill also the parents.

- cfengine config files are being rewritten to make them less criptic.

Monday, 19 November 2007

Manchester black holes for atlas

Atlas job failing because of the following errors:
====================================================
All jobs fail because 2 bad nodes fail like
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
submit-helper script running on host bohr1428 gave error: could not add entry in the local gass cache for stdout
===================================================
Problem caused by

${GLOBUS_LOCATION}/libexec/globus-script-initializer

being empty.

Tuesday, 13 November 2007

Some SAM tests don't respect downtime

Sheffield is shown to fail the CE-host-cert-valid test while in downtime. SAM tests should all behave the same. This is on top of the very confusing display of the results in alternate lines. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=28983

Sunday, 11 November 2007

Manchester sw repository reorganised

To simplify the maintainance of multiple releases and architectures I reorganised the software (yum) repository in Manchester.

While before we had to maintain a yum.conf for each release and architecture now we just need to add links to the right place. I wrote a recipe on my favourite site:

http://www.sysadmin.hep.ac.uk/wiki/Yum

This will allow to remove also the complications introduced in cfengine conf files to maintain multiple yum.conf versions.

Friday, 9 November 2007

1,000,000th job has passed by

This week the batch system ticked over it's millionth job. The lucky user was biomed005, and no, it was nothing to do with rsa768. The 0th job was way back in August 2005 when we replaced the old batch system with torque. How many of these million were successful? I shudder to think but I'm sure it's improving :-)

In other news, we're having big problems with our campus firewall, it blocks outgoing port 80 and 443 to ensure that traffic passes through the university proxy server. Unfortunately some web clients such as wget and curl make it impossible to use the proxy for these ports whilst bypassing the proxy for all other ports. Atlas need this with the new PANDA pilot job framework. We installed a squid proxy of our own (good idea Graeme) which allows for greater control. No luck with handling https traffic so we really need to get a hole punched in the campus firewall. I'm confident the uni systems guys will oblige ;-)

Sheffield in downtime

Sheffield has been put in downtime until Monday 12/11/2007 at 5 pm.


Reason: Power cut affecting much of central sheffield. Substation exploded. Not even allowed inside the physics building.

Matt is also back in the GOCDB now as site admin.

Saturday, 3 November 2007

Manchester CEs and RGMA problems

Still don't know what happened to ce02 and why ops test didn't work and my jobs hanged forever while anybody else could run (atlas claims 88 percent efficiency in the last 24 hours). Anyway I updated ce02 manually (rpm -ihv) to the same set of rpms that are on ce01 and now the problem I had, globus hanging, has disappeared . The ops tests are successful again and we got out of the atlas blacklisting. I fixed also ce01 that yesterday picked up the wrong java version. I need to change a couple of things on the kickstart server so that these incidents don't happen.

Also I had to manually kill tomcat which was not responding and restart it on the MON box. Accounting published successfully after this.

Friday, 2 November 2007

Sheffield accounting

From Matt:

/opt/glite/bin/apel-pbs-log-parser
is trying o contact the ce on 2170, I think expecting the site bdii to be there.
I changed ce_node</GII> to mon_node in /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml
and now thing seem much improved.

However, I am getting this

Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Record/s found: 8539
Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Checking Archiver is Online
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Unable to retrieve any response while querying the GOC
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Archiver Not Responding: Please inform apel-support@listserv.cclrc.ac.uk
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - WARNING - Received a 'null' result set while querying the 'LcgRecords' table using rgma, this probably means the GOC is currently off-line, will therefore cancel attempt to re-publish

running /opt/glite/bin/apel-publisher on the mon box.

I the goc machine is really off-line, I'll have to wait to publish the missing data for sheffield.

Manchester SL4

Not big news for other sites but I have installed an SL4 UI in manchester. Still 32bit cause the UIs at the Tier2 are old machines. However I'd like to express my relief that once the missing 3rd parties rpms were in place the installation
went smoothly.

After some struggling with cfengine keys which I was dispairing tosolve by the end of the evening I managed to install also a WN 32bit. At least cfengine doesn't give any more errors and runs happily.

Tackling dcache now and the new yaim structure.

Thursday, 1 November 2007

Sheffield

Quiet night for sheffield after reimaged nodes where taken offline in PBS. Matt also increased the number of ssh connections allowed on the CE from 10 to 100 to reduce the time outs between the WN and CE and reduce the incidence of Maradona errors.

Wednesday, 31 October 2007

Manchester hat tricks

Manchester CE ce02 has been blacklisted by atlas since yesterday because it fails
the ops tests and therefore it is also failing the Steve lloyds tests and has avaialbility 0. However there is no apparent reason why these tests should fail. Besides ce02 is doing some magic: there were 576 jobs running from 5 different VOs when I started writing this among which atlas production jobs, and now there 12 hours later there are 1128. I'm baffled.

Tuesday, 30 October 2007

Regional VOs

vo.northgrid.ac.uk
vo.southgrid.ac.uk

both created no users yet in them. Need to enable them at sites probably to get more progress.

user certificates: p12 to pem

since I was renewing my certificate I added a small script (p12topem.sh) to the subversion repository to convert users p12 certificates into pem format and set their unix permission correctly. i lnked it from here:

https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance

It assumes $HOME/.globus/user*.pem names. it doesn't therefore handle host certificates but could be easily extended.

Monday, 29 October 2007

Links to monitoring pages update

I added three links to the FCR one per experiment with all the UK sites selected. It hopefully will make easier to find out who has been blacklisted.

http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages


I also added GridMap link and linked Steve monitoring both as generic dteam and atlas plus the quarters summary plots.

Friday, 19 October 2007

Sheffield latest

Trying to stabilize Sheffield cluster.
After the scheduled power outage the nodes didn't restart properly and some of the old jobs needed to be cleaned up. After that the cluster was ok apart from the BDII dropping out. We have applied the famous Kostas patch

https://savannah.cern.ch/bugs/?16625


which is getting into the release after 1.5 years Hurray!!!

The stability of the BDII has improved and DPM seems stable. The SAM tests have been stable over the weekend and the today Steve Atlas tests showed a 96% availability which is a big improvement. However the cluster filled up this morning and the instability reappeared, sign that there is still something to fix on the worker nodes and in the scheduling. Added a reservation for ops and looking at the WNs some of which were re-imaged this morning.

Thursday, 18 October 2007

Manchester availability

SAM tests both ops and atlas, were failing due to dcache problems. Part of it was due to the fact that Judit has changed her DN and somehow the cron job to build the dcache kpwd file wasn't working. In addition to that dcahce02 had to be restarted (both core and pnfs) as usual it started to work again after that without any apparent reason of why it failed in the first place. gPlazma is not enabled yet.

Mostly that's the reason of the drop in october.

Monday, 15 October 2007

Availability update

Lancs site availability looks OK for the last month at 94% which is 13% above the GridPP average, and this includes a couple of weekends lost due to dCache problems. The record from July-September has been updated on Jeremy's page. We still get the occasional failed SAM submission, no idea what causes these but they serve to deny the availability reaching high nineties.

  • June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed.
  • mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs.
  • mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood.
Job slot occupancy looks ok, non-HEP VOs like fusion and biomed helping to fill slots left by moderate production by Atlas.

Friday, 12 October 2007

Sys Admin Requests wiki pages

YAIM has a new wiki page for sys admins requests. Maria has sent an announcement to the LCG-ROLLOUT. I added, for bookkeeping, a link and explanations in the sys admin wiki wishlist page where also the ROCs admins management tools requests is linked from.

http://www.sysadmin.hep.ac.uk/wiki/Wishlist

Tuesday, 9 October 2007

BDII doc page

After the trouble sheffield went through with the BDII I started a BDII page in the sysadmin wiki.

http://www.sysadmin.hep.ac.uk/wiki/BDII

Monday, 8 October 2007

Manchester RGMA fixed

Fixed RGMA in Manchester. It had, for still obscure reasons, wrong permissions on the host key files. Started a RGMA troubleshooting page on sysadmin wiki:

http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA

EGEE '07

EGEE conference. I've given a talk in the SA1-JRA1 session which seems to have had a positive result which will hopefully have some follow up.

Talk can be found at

http://indico.cern.ch/materialDisplay.py?contribId=30&sessionId=49&materialId=slides&confId=18714


and is the sibling of the one we gave in Stockholm at the Ops workshop on problems with SA3 and within SA1.

http://indico.cern.ch/contributionDisplay.py?contribId=25&confId=12807

which had some follow up with SA3 that can be found here

https://savannah.cern.ch/task/?5267

Its alive !

Sheffield problems reviewed

During the last update to the DPM I ran in to several problem.

1. DPM update failed due to changes in the way the password are
stored in mysql
2. A miss understandind with the new version of yaim that rolled out
at the same time
3. Config errors with the sBDii
4. mds-vo-name
5. Too many roll outs in one go for me to have a clue which broke and where to start
looking.

DPM update fails
I would like to thank Graeme for the great update instructions, they
helped lots. The problems came when the update script used a
different hashing method to the one used by mysql problem found here
http://<>. This took some finding, it also means every
time we run yaim config on the SE we have to go back and fixs the
passwords again, this is because yain still uses the old hash not the
new one.

Yaim update half way and congig errors
This confused the hell out of me one minute I'm using yaim scripts to
run updates. Next I have an updated version of yaim that I had to
pass flags to and is where I guess I started to make the mistakes that
lead to me setting the SE as a sBDii. After getting lost with the new
yain I told the wrong machine that it was a sBDii and never relised.

mds-vo-name
With the help of Henry, we found out that our information was wrong ie we had

mds-vo-name=local it is now mds-vo-name=resource

Once this was changed in the site-info.def and yaim was re ran on our mon box which is
also out sBDii it alll seamed to work.

Tuesday, 25 September 2007

Sheffield

Hi all

Sorry it been so quiet on the Sheffield front. I've been out of the country, and it currently registration here.

Whats the state of the LCG here. I feel like I'm chasing my tail hence there will shortly be a back of email asking for help from TB Support. I have notice there have been several updates while I've been away so I will add them before getting back to the main problems of why our SE seam to not have an entry in the BDii

Friday, 14 September 2007

Latest on WN SL4/64 upgrade

I've created a gridpp wiki page which lists the cfengine config we're using to satisfy various VO requirements. Things have changed recently with Atlas no longer requiring a 32 bit version of python to be installed, it's now included in the KITS release. We still have build problems with release 12.0.6 as used by Steve's tests so would be interested to see how others get on with that. The Atlas experts advise a move to the 13.0.X branch. Atlas production looks healthy again with plenty of queued jobs for the weekend so hopefully smooth sailing from now on.

Advice to Atlas sites upgrading to SL4/64:
  • Expect failures when building code with release 12.0.6

Sunday, 26 August 2007

Some new links about security

This article is an interesting example of how even someone with very little experience can still do some basic forensic.

http://blog.gnist.org/article.php?story=HollidayCracking

I added the link under the forensic section on the sys admin wiki

http://www.sysadmin.hep.ac.uk/wiki/Basic_Security#Forensic

Since I was at it I added a firewall section to

http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Firewall_configuration_and_Services_ports

Dcache Troubleshooting page

My tests on dcache started to fail for obscure reasons due to gsidcap doors misbehaving. I started a trouble shooting page for dcache

http://www.sysadmin.hep.ac.uk/wiki/DCache_Troubleshooting

Thursday, 23 August 2007

Another biomed user banned

Manchester has banned another biomed user for filling /tmp and causing trouble for other users. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=26147

Wednesday, 22 August 2007

WNs installed with SL4

Last Wednesday was a scheduled downtime for Lancaster in order to do the SL4 upgrade on the WNs, as well as some assorted spring cleaning of other services. We can safely say this was our worst upgrade experience so far, a single day turned into a three-day downtime. Fortunately for most, this was self-inflicted pain rather than middleware issues. The fabric stuff (PXE, kickstart) went fine, our main problem was getting consistent users.conf and groups.conf files for the YAIM configuration, especially with the pool sgm/prd accounts and the dns-style VO names such as supernemo.vo.eu-egee.org. The latest YAIM 3.1 documentation provides a consistent description but our CE still used the 3.0 version so a few tweaks were needed (YAIM 3.1 has since been released for glite 3.0). Another issue was due to our wise old CE (lcg-CE) having a lot of crust from previous installations, in particular some environment variables which affected the YAIM configuration such that the newer vo.d/files were not considered. Finally, we needed to ensure the new sgm/prd pool groups were added to torque ACLs but YAIM does a fine job with this should you choose to use it along with the _GROUP_ENABLE variables.

Anyway, things look good again with many biomed jobs, some atlas, dzero, hone and even a moderate number of lhcb jobs which supposed to have issues with SL4.


On the whole, the YAIM configuration went well although the VO information at CIC could still be improved with mapping requirements from VOMS groups to GID. LHCb provide a good example to other VOs, with explanations.

Monday, 20 August 2007

Week 33 start of 34

Main goings on have been the DMP update this was hampered by password version problems in MySQL, was resolved with help from here http://www.digitalpeer.com/id/mysql . More problem came with the change to BDii new firewall ports open and yet there was still no data coming out.

The BDii was going to be fixed today, however Sheffield has suffered several power cut over the last 24 hours. This has affected the hole of the LCG here, recovery work is ongoing.

Tuesday, 14 August 2007

GOCDB3 permission denied

I can't edit NorthGrid sites anymore. I opened a ticket.

https://gus.fzk.de/pages/ticket_details.php?ticket=25846

I would be mildly curious to know if other people are experiencing the same or if I'm the only one.

Monday, 13 August 2007

Manchester MON box overloaded again with a massive amount of CLOSE_WAIT connections.

https://gus.fzk.de/pages/ticket_details.php?ticket=25647

the problem seems to have been fixed but it affected the accounting for 2 or 3 days.

lcg_utils bug closed

Ticket about lcg_util bugs has been answered and closed

https://gus.fzk.de/pages/ticket_details.php?ticket=25406&from=allt

Correct version of rpms to install is

[aforti@niels003 aforti]$ rpm -qa GFAL-client lcg_util
lcg_util-1.5.1-1
GFAL-client-1.9.0-2

Update (2007/08/17): The problem was incorrect dependencies expressed in the meta rpms. Maarten opened a savannah bug.

https://savannah.cern.ch/bugs/?28738



Friday, 10 August 2007

Updating Glue schema

This is sort of old news as the request of updating the BDII is one month old.

To update the Glue schema you need to update the BDII on the BDII machine and on the CE and SE (dcache and classic). DPM SE uses BDII instead of globus-mds now so you should check the recipe for that.

The first problem I found was that

yum update glite-BDII

doesn't update the dependencies but only the meta-rpm. Apparently it works for apt-get but not for yum. So if you use yum you have 3 alternatives

1) yum -y update and risk to screw your machine
2) yum update and check each rpm
3) Look the list of rpms here

http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-BDII/3.0.2-12/glite-BDII-3.0.2-12.html

yum update

Reconfiguring the BDII doesn't pose a threat so you can

cd
./scripts/configure_node BDII_site

On the CE and SE... you can upgrade the CE and SE and reconfigure the nodes. But I didn't want to do that because you never know what might happen and with the farm full of jobs and the SE being dcache I don't see the point to risk it for a schema upgrade. So what follows is a simple recipe to upgrade the glue schema on CE and SE other than DPM without reconfiguing the nodes.

service globus-mds stop
yum update glue-schema
cd /opt/glue/schema
ln -s openldap-2.0 ldap
service globus-mds start

To check that it worked:

ps -afx -o etime,args | grep slapd

if your BDII is not on the CE and you find slapd instances on ports 2171-2173 it means you are running site BDIIs also on your CE and you should turn it off and remove it from the startup services.

The ldap link is needed because the schema path has changed and unless you want to edit the configuration file (/opt/globus/etc/grid-info-slapd.conf) the easiest thing is to add a link.

Most of this is in this ticket

https://gus.fzk.de/pages/ticket_details.php?ticket=24586&from=allt

including where to find the new schema documentation.

Thursday, 9 August 2007

Documentation for Manchester local users

Yesterday at a meeting with Manchester users who tried to use the grid it turned out that what they missed most is a page to collect the links of information sparse around the world (a common disease). As a consequence we have started pages to collect information useful to local users to use the grid.

https://www.gridpp.ac.uk/wiki/Manchester


Current links are of general usefulness. Users will add their own personal tips and tricks later.

Wednesday, 8 August 2007

How to check accounting is working properly

Obviously when you look at the accounting pages at the bottom there is a graph showing running VOs, but that is not straightforward. Other two ways are

The accounting enforcement page showing sites that are not publishing and for how many days they haven't published.

http://www3.egee.cesga.es/acctenfor

which I linked from

https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Accounting

or you could setup RSS feeds as suggested in the Apel FAQ.

I also created an Apel page with this information on the sysadmin wiki

http://www.sysadmin.hep.ac.uk/wiki/Apel

Monday, 6 August 2007

Progress on SL4

As part of our planned upgrade to SL4 at Manchester, we've been looking at getting dcache running.
The biggest stumbling block is a lack of glite-SE_dcache* profile, luckily it seems that all of the needed components apart from dcache-server are in the glite-WN profile. Even the GSIFtp Door appears to work.

Friday, 3 August 2007

Green fields of Lancaster

After sending the dcache problem the way of the dodo last week we've been enjoying 100% SAM test passes over the past 7 days. It's nice to have to do next to nothing to fill in your weekly report. Not a very exciting week otherwise, odd jobs and maintenance here and there. Our CE has been very busy the last week, which has caused occasional problems with the Steve Lloyd tests-we've had a few failures due to there being no job slots available, despite measures to prevent that. We'll see if we can improve things.

We're gearing up for the SL4 move- after Monday's very useful Northgrid meeting at Sheffield we have a time frame for it-sometime during the week starting the 13th of August. We'll pin it down to an exact day at the start of the coming week. We've took a worker offline as a guinea pig and will do hideous SL4 experimentations to it. The whole site will be in downtime for 9-5 on the day we do the move, with luck we won't need that long but we intend to use the time to upgrade the whole site (no SL3 kernels will be left within our domain). Lucky for us Manchester have offered to go first in Northgrid, so we'll have veterans of the SL4 upgrade nearby to call on for assistance.

Thursday, 2 August 2007

lcg-utils bugs

https://gus.fzk.de/pages/ticket_details.php?ticket=25406

Laptop reinstalled

EVO didn't work on my laptop. I reinstalled it with latest version of ubuntu and java 1.6.0. It works now. With my great disappointment facebook aquarium still doesn't ;-)

Fixed Manchester accouting

https://www.ggus.org/pages/ticket_details.php?ticket=25215

Glue Schema 2.0 use cases

Sent two broadcasts to collect Glue Schema use cases for the new 2.0 version. Received only two replies.

https://savannah.cern.ch/task/index.php?5229

How to kill a hanging job?

There is a policy being discussed about this. See:

https://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.doc

written by Graeme and Matt.

Part of the problem is that the user doesn't see any difference between a job that died and one that was killed by a system administrator. One of the request is to get the job wrapper catching the signal the standard tools send so that an appropriate message can be returned and possibly also some cleanup be done. This last part is being discussed at the TCG.

https://savannah.cern.ch/task/index.php?5221

SE downtime

Tried to publish GlueSEStatus to fix the SE downtime problem

https://savannah.cern.ch/task/?5222

Connected to this is ggus ticket

https://www.ggus.org/pages/ticket_details.php?ticket=24586


which was originally opened to get a recipe for sites to upgrade the BDII in a painless way.

VO deployment

Sent comments for final report of VO deployment WG to Frederic Schaer.

I wrote a report about this over a year ago:

https://mmm.cern.ch/public/archive-list/p/project-eu-egee-tcg/Why%20it%20is%20a%20problem%20adding%20VOs.-770951728.EML?Cmd=open

The comments in my email to the TCG are still valid.

I think the time to find the information for a vo is too short. It takes more than 30 mins and normally people ask other sys admins. I found the cic portal tool inadequate up to now. It would be better if the VOs maintained themselves a yaim snapshot on the cic portal that can be downloaded rather than inventing a tool. In UK that's the way we have chosen at the end to avoid this problem.

http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

This is maintained by sysadmins and it is only site-info.def. group.conf is not maintained by anyone but it should at the moment sysadmins are simply replicating the default, when a VO like LHCB or dzero deviates from that there is trouble.

2) YAIM didn't use to have a VO creation/deletion function for each service that can be run. It reconfigure the whole service that makes the
sys admins wary of adding a VO in production in case something went wrong in other parts. From you report this seems to be still the case.

Dashboard updated

Dashboard updated with new security advisories link
https://www.gridpp.ac.uk/wiki/Northgrid-Dashboard#Security_Advisories

Sheffield Jully looking back

July had to point of outage the worst being at the start of the month and just after a gLite 3.0 upgrade it did take a bit of time to find the problem and the solution.

Error message: /opt/glue/schema/ldap/Glue-CORE.schema: No such file or directory
ldap_bind: Can't contact LDAP server
Solution was found here: http://wiki.grid.cyfronet.pl/Pre-production/CYF-PPS-gLite3.0.2-UPDATE33

At the end of the month we has a strange error that was spotted quickly and turn out to be the result of a DNS server crash on the LGC here at Sheffield not resolving the worker nodes IPs

Sheffield hosted the monthly North grid meeting and all in all it was a good event.

Yesterday the LCG got it's own dedicated 1gig link to YHman and beyond we also now have our own firewall which will make changes quicker and easier.

Fun at Manchester SL4, lcg_util and pbs

In the midst of getting a successful upgrade-to-SL4 profile working, we upgraded our release of lcg_util from 1.3.7-5 to 1.5.1-1 this prooved to be unsuccessful, SAM test failures galore. After looking around for a solution on the internet I settled for rolling back to the previous version, thanks to the wonders of cfengine this didn't take long, and happily cfengine should be forcing that version on all nodes.

This Morning I came in to find we were again failing the SAM tests, this time the ever-so-helpful
"Cannot plan: BrokerHelper: no compatible resources"


Pointing to a problem deep in the depths of the batch system. Looking at our queues (via showq), there were a lot of Idle jobs yet more than enough CPUs. The PBS logs revealed a new error message,
Cannot execute at specified host because of checkpoint or
stagein files
for two of the jobs, eventually I managed to track it down to a node. Seeing as there wasn't any sign of the job file anymore, and pbs was refusing to re-run the job on another node, I had to resort to the trusty `qdel`, after thinking about it for the barest of moments, all of the Idle jobs woke up and started running.

Just for some gratuitous cross-linking, Steve Traylen appears to have provided a solution Over at the Scotgrid blog.

Friday, 27 July 2007

It's always a good feeling putting a timely end to a long running problem. We've been plagued by the pnfs permission denied error for a little over a month- but hopefully (touch wood) we won't be seeing it again.

So what was the problem? It appears to have been a gridftp door nfs mount asynchronisation problem. Or something like that. Essentially the nfs mounts that the gridftp door reads occasionally failed to update quickly enough, so tests like the SAM tests that copy in a file then immediately try to access it occasionally barfed- it checked to see if the file was there but as it hadn't been updated it didn't. A subsequent test might occur from the door after it's synced, or of an already synced door, and thus pass. I only tracked this down after trying to recreate my own transfer tests and finding I only failed copying out "fresh" files. Mentioning this during the UKI meeting Chris Brew pointed me towards a problem with my PNFS mounts and Greig helped point me towards what they should be (thanks dudes!). I found that my mounts where missing the "noac" (no attribute caching) option-this option is what Grieg and Chris have on their mounts and recommended for multiple clients accessing one server. So a quick remount of all my doors and things seem miraculously all better- my homemade tests all worked and Lancaster's SAM tests are a field of calming green. Thanks to everyone for their help and ideas.

In other news, we're gearing up for the SL4 move- ETA for that about a fortnight. We plan to lay the ground work then try to reinstall all the WNs in one day of downtime, we'll have to see if this is a feasible plan.

Have a good weekend all, I know I will!

Friday, 20 July 2007

Replica manager failure solution? Permission denied

We're still facing down the same dcache problem. Consultation with the experts has directed us to the fact that the problem isn't (directly) a gPlazma one-the request are being assigned to the wrong user (or no user) and suffering permission problems. Even if immediately preceding and proceding access requests of a similar nature with the same proxy succeed without a glitch. The almost complete lack of a sign of this problem in the logs is increasing the frustration. I've sent the developers every config file they've asked for and some they didn't, upped the logging levels and scoured logs till my eyes were almost bleeding. And I haven't a clue how to fix this thing. There's no real pattern-other then the fact that failures seem to happen more often during the working day (but then it's a small statistical sample)--- there are no corresponding load spikes. We have a ticket open against us about this and the word quarantine got mentioned-which is never a good thing. It sometimes feels like we have a process running in our srm that rolls a dice every now and again and if it comes up as a 1 we fail our saving throw vs. SAM test failure. If we could just find a pattern or cause then we'd be in such a better position. All we can do is keep scouring and maybe the odd tweak and see if something presents itself.

Friday, 13 July 2007

Torque scripts

Added two scripts to the sysadmin repository:

https://www.sysadmin.hep.ac.uk/wiki/Nodes_Down_in_Torque

https://www.sysadmin.hep.ac.uk/wiki/Counting_Jobs

Happy Friday 13th

Business at usual for Lancaster, with CA updates and procuring new hardware on our agenda this week. We're also being plagued by a(nother) wierd intermittant dcache problem that's sullying what would be an otherwise flawless SAM test run. We've managed to track it down to some kind of authorisation/load problem after chatting to the dcache experts (which makes me groan after the the fun and games of configuring gPlazma to work for us). At least we know which tree we need to bark up, hopefully the next week will see the death of this problem and the green 'ok's will reign supreme for us.

Pretty graphs via MonAMI


Paul in Glasgow has written ganglia scripts to display a rich set of graphs using torque info collected by MonAMI. We've added these to the Lancaster ganglia page. Installation was straight forward although we had some problems due to the ancient version of rrdtools installed on this SL3 box. Paul made some quick patches and things are now compatible with this older version and results are not too different compared to ScotGrid's SL4 example. Useful stuff!

Wednesday, 11 July 2007

Site BDII glite 3.0.2 Update 27

Updating the glite-BDII meta rpm doesn't update the rpms to the required version. I opened a ticket.

http://gus.fzk.de/pages/ticket_details.php?ticket=24586


Monday, 9 July 2007

glite CE passing arguments to Torque

Started testing glite CE passing arguments to Torque server. Installed a gCE with two WN. Should be inserted in a BDII at IC. The script that does the translation from Glue schema attributes to submission parameters is not there. I have the LSF equivalent. Script /opt/glite/bin/pbs_submit.sh works standalone so I could start to look at it on my laptop where I have a 1 WN system.

The problem is not passing hte argument but the number of ways a sys admin can setup a configuration to do the same. For the memory parameters it is not a very big problem but for the rest of the possible configurations it is. A discussion on a standard configuration is required. Snooping around, most of the sites that allows connection have YAIM standard configuration, but possibly that's because quite few things break if they don't.

Sheffield week 28

Now begins the trying to unpick what wrong here at Sheffield.
So far main problem seams to be with the CE

On the plus the DPM patching went well

sheffield week 27

Arrrr it all gone wrong things very broke and not enough time to fix them.

Power outages

Friday, 29 June 2007

Liverpool Weekly Update

Most of the work at Liverpool this week was the usual networking and ongoing hardware repairs. We should have some more nodes becoming available (approximately 50) over the next few weeks thanks in particular to the work Dave Muskett has been doing.

We've started to look at configuration deployment systems, i.e. cfengine (Colin's talk at HEPSysMan on this was helpful), Puppet and Quattor. We're presently evaluating Puppet on some of our non-LCG systems, and we look forward to discussing this subject at the next technical meeting.

And as mentioned last week, Paul Trepka is in the process of adding support for additional VOs. A couple of errors were encountered in this process yesterday resulting in failed SAM tests overnight, but these have (hopefully!) been rectified now.

Lancaster Weekly Update

This week has not been fun, having been dominated by a dcache configuration problem that's caused us to fail the week's worth of SAM tests. The gPlazma plugin- the voms module for dcache, had started complaining about not being able to determine which username to map an access request to from a given certificate proxy. This problem was made worse by dcache then not falling back to use the "old fashioned" kpwd file method of doing the mapping. So users were getting "access denied" type messages. Well, all users except some, who had no problem at all, these privilaged few included my good self so diagnosing this problem involved a lot of asking Brian if it was still broke.

After some masterful efforts from Greig Cowan and Owen Synge we finally got things back up and running again. Eventually we fixed things by:

Upgrading to Owen's latest version of glite-yaim (3.0.2.13-3), and installing his config script dcacheVoms2GPlasma. After some bashing and site-info tweaking this got the gPlazma config files looking a bit more usable.

Fiddling with the permissions so that the directories in pnfs were group writable (as now some users were being mapped to the sgm/prd vo accounts).

Upgrading to dcache-server-1.7.0-38.

Between all these steps we seem to have things working. We're still unsure why things broke in the first place, why gPlazma wouldn't fall back to the kpwd way of doing things or why it still worked for some and not for others. I'd like to try and get to the bottom of these things before I draw a line under this problem.

Thursday, 28 June 2007

Sheffield week26

Wet wet and did I mention it rained here.

Not much to report to do with the cluster it is still up and running although we started failing lloyds tests yesterday afternoon. I will look into this when I get time

The University's power is in a state of "At Risk" until midday Friday. As a result Sheffield might go off line with out warning.

Monday, 25 June 2007

Manchester weekly update

This week, after passing the Dell Diagnostic Engineer course, I've been diagnosing Dell hardware issues, and getting Dell to provide component replacements, or send an engineer. Finally they aren't treating us like a home user. I've also been sorting out issues between a recently installed SL4 node, kickstart leaving partitions intact, and cfengine.

Colin has been working on a new nagios plugin (and no doubt other things).

Friday, 22 June 2007

Liverpool Weekly Update

This week's work at Liverpool was mostly a continuation of last week's - more networking as we bring the new firewall/router into operation, and more Nagios and Ganglia tweaking as we add more systems and services to the monitoring.

Plans to add a second 1Gbps link from our cluster room to Computing Services to create a combined 2Gbps link, along with an additional third 1Gbps link to a different building for resilience, have taken a step forward. A detailed proposal for for this has now been agreed with Computing Services and funding approved.

Alessandra made a useful visit yesterday, providing help with adding additional VOs (which is being done today, all going well) and investigating problems with ATLAS software installation amongst other things.

Monday, 18 June 2007

Flatline!


Last week was moderately annoying for the Lancaster CE, with hundreds of jobs immediately failing on WNs due to skewed clocks. The ntpd service was running correctly so we were in the dark about the cause. After trying to re-sync manually with ntpdate it was apparent something was wrong with the university ntp server, it only responded to a fraction of requests. Turned out to be problem with the server "ntp.lancs.ac.uk" which is an alias for these machines:
ntp.lancs.ac.uk has address 148.88.0.11
ntp.lancs.ac.uk has address 148.88.0.8
ntp.lancs.ac.uk has address 148.88.0.9
ntp.lancs.ac.uk has address 148.88.0.10

Only 148.88.0.11 is responding so I raised a ticket with ISS and look forward to a fix. In the meantime the server has been changed to 148.88.0.11 in the ntp.conf file managed by cfengine and it's rolled out without a problem.

Just to stick the boot in, an unrelated issue caused our job slots to be completely vacated over the weekend and we've started to fail Steve's Atlas tests. This is due to a bad disk on a node which went read-only. Need to find the exact failure mode in order to make yet another WN health check, this oneslipped past existing checks. :-( Currently at Michigan State Uni (Go Spartans!) for the DZero workshop and the crippled wireless net makes debugging painful.

Friday, 15 June 2007

Lancaster Weekly Update -the Sequel

A bit of an unexciting week. A lot of intermittant short (one test) replica manager failures might point to a small stability issue for our SE- however SAM problems this week have prevented me from finding the details of each failure. The best I could do was if we failed a test poke our SRM to make sure that it was working. The trouble looking at the SAM records made this week's weekly report for Lancaster quite dull.

After last week's PNFS move and postgres mysteriously behaving after being restarted a few times on Tuesday our CPU load on the SRM admin node is now under control, which should greatly improve our performance and make timeouts caused by our end a thing of the past. Finger's crossed.

Another notable tweak this week was an increase of the number of pool partitions given to atlas-they now have exclusive access to 5/6 of our dcache. Our dcache is likely to grow in size in the near future as we reclaim a pool node that was being used for testing, which will increase our SRM by over 10TB, this 10TB will be split in the same way as the rest of the dcache.

My last job with the SRM (before we end up upgrading to the next dcache version whenever that comes) is to deal with a replica infestation. During a test of the replica manager quite a while ago now we ended up with a number of files replicated 3-4 times and for some reason all replicas were marked as being precious- preventing them being cleaned up via the usual mechanisms. Attempts to force the replica manager to clean up after itself have failed, even giving it weeks to do it's job yielded no results. It looks like we might need a VERY carefully written script to clean things up and remove the few TB of "dead space" we have at the moment.

Liverpool Weekly Report

Recent work at Liverpool has included:
  • Monitoring improvements - I've configured Nagios and John Bland is rolling out Ganglia, both of which have already proved very useful. We're also continuing to work on improving environmental monitoring here, particularly as relates to detecting failures in the water-cooling system.
  • Significant hardware maintenance, including replacing two failed Dell Powerconnect 5224 switches in a couple of the water-cooled racks with new HP Procurve 1800s - more difficult than it should be due to the water-cooling design - and numerous node repairs.
  • Network topology improvements, including installation of a new firewall/router.

Most of this week was spent trying to identify the reason why Steve Lloyd's ATLAS tests were mostly being aborted and why large numbers of ATLAS production jobs were failing here, mostly with the EXECG_GETOUT_EMPTYOUT error. I eventually identified the main problem as being with the existing ssh configuration on our batch cluser, where a number of host keys for worker nodes were missing from the CE. This (along with a couple of other issues) has now been fixed, and hopefully we'll see a large improvement in site efficiency as a result.

While investigating this, I also noticed a large number of defunct tar processes left over on multiple nodes by the atlasprd user, which had been there for up to 16 days. We're not sure what caused these processes to fail to exit, so any insights on that would be welcome.

Finally, Paul Trepka has been bringing up a new deployment system for the LCG racks - see him for details.

Sheffield week 24

I think I'm slowly getting my head round all this now {but don't test me ;)}

Technicaly there is not much new to report some down workers have had new disks put in them. Plans are bing made to upgrade the worker and to finish sorting out Andy's legacy.

Main problem the building where the machine room is housed is a no access building site and I have been warned about a power outage in July.

Wednesday, 13 June 2007

Sheffield Update

I go away for a long weekend and we start failing SAM tests again. After a few email from Greig and some time waiting for the next tests we are now passing all the tests.

Our failings over the past few weeks seam to be down to one of 2 things cert upgrades not automatically working on all machines and me not knowing when and how to change the DN.

We have fixed the gridice information about disk sizes on the SE, as well as looking into adding more pools.

back to my day job

Friday, 8 June 2007

Lancaster Weekly Update

A busy week for Lancaster on the SE front. We had the "PNFS move", where the PNFS services were moved from the admin node onto their own host. There were complications, mainly caused by the fact that several key details were missing from the recipe I found was missing 1 or 2 key details that I overlooked when preparing for it.

I am going to wikify my fun and games, but essentially my problems can be summed up as:

Make sure in the node_config both the admin and pnfs nodes are marked down as "custom" in their node type. Keeping the admin node as "admin" causes it to want to run PNFS as well.

In the pnfs exports directory make sure the srm node is in there, and that on the srm node the pnfs directory on the pnfs node is mounted (similar to how door nodes are mounted-although not quite the same-to be honest I'm not sure I have it right but it seems to work.

Start things in the right order- the pnfs server on the PNFS node, the dcache-core services on the admin node, then the PNFSmanager on the PNFS node. I found that on a restart of the admin node services I had to restart the PNFSmanager. I'm not sure how I can fix this to enable automatic startups of our dcache services in the correct order.

Make sure that postgres is running on the admin node- it won't produce an error on startup if postgres isn't up (as it would have done if running pnfs on the node), but it will simply not work when you attempt transfers.

Don't do things with a potential to go wrong on a Friday afternoon if you can avoid it!

Since the move we have yet to see a significant performance increase, but then it's yet to be seriously challenged. We performed some more postgres housekeeping on the admin node after the move which made it a lot happier. Since the move we have noticed occasional srm SFT failures with a "permission denied" type failure, although checking things in the pnfs namespace we don't see any glaring ownership errors. I'm investigating it.

We have had some other site problems this week caused by the timing to be off on several nodes to be off by a good few minutes. It seems Lancaster's ntp server is unwell.

The room where we keep our Pool Nodes is suffering from heat issues. This always leaves us on edge, as our SE has had to be shut down before because of this, and the heat can make things flakey. Hopefully that machine room will get more cooling power and soon.

Other site news from Peter:
Misconfigured VO_GEANT4_SW_DIR caused some WNs to have a full /
partition, becoming blackholes. On top of this, a typo (extra quote) in
the site-info.conf caused lcg-env.sh to be messed up, failing jobs
immediately. Fixed now but flags up how sensitive the system is to
tweaks. Our most stable production month was when we implemented a
no-tweak policy.

Manchester Weekly Update

So far this week, we've had duplicate tickets from GGUS about a failure with dcache01 (affect the ce01 SAM tests), all transfers were stalling, I couldn't debug this as my certificate expired the day i returned from a week off and my dteam membership still hasn't been updated. Restarting the dcache headnode fixed this.
And this morning I discovered that a number of our Worker Nodes had full /scratch partitions, the problem has been tracked to a phenogrid user, and we're working with him on attempting to isolate the issue.