Northgrid-tech: June 2007

Friday, 29 June 2007

Liverpool Weekly Update

Most of the work at Liverpool this week was the usual networking and ongoing hardware repairs. We should have some more nodes becoming available (approximately 50) over the next few weeks thanks in particular to the work Dave Muskett has been doing.

We've started to look at configuration deployment systems, i.e. cfengine (Colin's talk at HEPSysMan on this was helpful), Puppet and Quattor. We're presently evaluating Puppet on some of our non-LCG systems, and we look forward to discussing this subject at the next technical meeting.

And as mentioned last week, Paul Trepka is in the process of adding support for additional VOs. A couple of errors were encountered in this process yesterday resulting in failed SAM tests overnight, but these have (hopefully!) been rectified now.

Lancaster Weekly Update

This week has not been fun, having been dominated by a dcache configuration problem that's caused us to fail the week's worth of SAM tests. The gPlazma plugin- the voms module for dcache, had started complaining about not being able to determine which username to map an access request to from a given certificate proxy. This problem was made worse by dcache then not falling back to use the "old fashioned" kpwd file method of doing the mapping. So users were getting "access denied" type messages. Well, all users except some, who had no problem at all, these privilaged few included my good self so diagnosing this problem involved a lot of asking Brian if it was still broke.

After some masterful efforts from Greig Cowan and Owen Synge we finally got things back up and running again. Eventually we fixed things by:

Upgrading to Owen's latest version of glite-yaim (3.0.2.13-3), and installing his config script dcacheVoms2GPlasma. After some bashing and site-info tweaking this got the gPlazma config files looking a bit more usable.

Fiddling with the permissions so that the directories in pnfs were group writable (as now some users were being mapped to the sgm/prd vo accounts).

Upgrading to dcache-server-1.7.0-38.

Between all these steps we seem to have things working. We're still unsure why things broke in the first place, why gPlazma wouldn't fall back to the kpwd way of doing things or why it still worked for some and not for others. I'd like to try and get to the bottom of these things before I draw a line under this problem.

Thursday, 28 June 2007

Sheffield week26

Wet wet and did I mention it rained here.

Not much to report to do with the cluster it is still up and running although we started failing lloyds tests yesterday afternoon. I will look into this when I get time

The University's power is in a state of "At Risk" until midday Friday. As a result Sheffield might go off line with out warning.

Monday, 25 June 2007

Manchester weekly update

This week, after passing the Dell Diagnostic Engineer course, I've been diagnosing Dell hardware issues, and getting Dell to provide component replacements, or send an engineer. Finally they aren't treating us like a home user. I've also been sorting out issues between a recently installed SL4 node, kickstart leaving partitions intact, and cfengine.

Colin has been working on a new nagios plugin (and no doubt other things).

Friday, 22 June 2007

Liverpool Weekly Update

This week's work at Liverpool was mostly a continuation of last week's - more networking as we bring the new firewall/router into operation, and more Nagios and Ganglia tweaking as we add more systems and services to the monitoring.

Plans to add a second 1Gbps link from our cluster room to Computing Services to create a combined 2Gbps link, along with an additional third 1Gbps link to a different building for resilience, have taken a step forward. A detailed proposal for for this has now been agreed with Computing Services and funding approved.

Alessandra made a useful visit yesterday, providing help with adding additional VOs (which is being done today, all going well) and investigating problems with ATLAS software installation amongst other things.

Monday, 18 June 2007

Flatline!

Last week was moderately annoying for the Lancaster CE, with hundreds of jobs immediately failing on WNs due to skewed clocks. The ntpd service was running correctly so we were in the dark about the cause. After trying to re-sync manually with ntpdate it was apparent something was wrong with the university ntp server, it only responded to a fraction of requests. Turned out to be problem with the server "ntp.lancs.ac.uk" which is an alias for these machines:
ntp.lancs.ac.uk has address 148.88.0.11
ntp.lancs.ac.uk has address 148.88.0.8
ntp.lancs.ac.uk has address 148.88.0.9
ntp.lancs.ac.uk has address 148.88.0.10

Only 148.88.0.11 is responding so I raised a ticket with ISS and look forward to a fix. In the meantime the server has been changed to 148.88.0.11 in the ntp.conf file managed by cfengine and it's rolled out without a problem.

Just to stick the boot in, an unrelated issue caused our job slots to be completely vacated over the weekend and we've started to fail Steve's Atlas tests. This is due to a bad disk on a node which went read-only. Need to find the exact failure mode in order to make yet another WN health check, this oneslipped past existing checks. :-( Currently at Michigan State Uni (Go Spartans!) for the DZero workshop and the crippled wireless net makes debugging painful.

Friday, 15 June 2007

Lancaster Weekly Update -the Sequel

A bit of an unexciting week. A lot of intermittant short (one test) replica manager failures might point to a small stability issue for our SE- however SAM problems this week have prevented me from finding the details of each failure. The best I could do was if we failed a test poke our SRM to make sure that it was working. The trouble looking at the SAM records made this week's weekly report for Lancaster quite dull.

After last week's PNFS move and postgres mysteriously behaving after being restarted a few times on Tuesday our CPU load on the SRM admin node is now under control, which should greatly improve our performance and make timeouts caused by our end a thing of the past. Finger's crossed.

Another notable tweak this week was an increase of the number of pool partitions given to atlas-they now have exclusive access to 5/6 of our dcache. Our dcache is likely to grow in size in the near future as we reclaim a pool node that was being used for testing, which will increase our SRM by over 10TB, this 10TB will be split in the same way as the rest of the dcache.

My last job with the SRM (before we end up upgrading to the next dcache version whenever that comes) is to deal with a replica infestation. During a test of the replica manager quite a while ago now we ended up with a number of files replicated 3-4 times and for some reason all replicas were marked as being precious- preventing them being cleaned up via the usual mechanisms. Attempts to force the replica manager to clean up after itself have failed, even giving it weeks to do it's job yielded no results. It looks like we might need a VERY carefully written script to clean things up and remove the few TB of "dead space" we have at the moment.

Liverpool Weekly Report

Recent work at Liverpool has included:

Monitoring improvements - I've configured Nagios and John Bland is rolling out Ganglia, both of which have already proved very useful. We're also continuing to work on improving environmental monitoring here, particularly as relates to detecting failures in the water-cooling system.
Significant hardware maintenance, including replacing two failed Dell Powerconnect 5224 switches in a couple of the water-cooled racks with new HP Procurve 1800s - more difficult than it should be due to the water-cooling design - and numerous node repairs.
Network topology improvements, including installation of a new firewall/router.

Most of this week was spent trying to identify the reason why Steve Lloyd's ATLAS tests were mostly being aborted and why large numbers of ATLAS production jobs were failing here, mostly with the EXECG_GETOUT_EMPTYOUT error. I eventually identified the main problem as being with the existing ssh configuration on our batch cluser, where a number of host keys for worker nodes were missing from the CE. This (along with a couple of other issues) has now been fixed, and hopefully we'll see a large improvement in site efficiency as a result.

While investigating this, I also noticed a large number of defunct tar processes left over on multiple nodes by the atlasprd user, which had been there for up to 16 days. We're not sure what caused these processes to fail to exit, so any insights on that would be welcome.

Finally, Paul Trepka has been bringing up a new deployment system for the LCG racks - see him for details.

Sheffield week 24

I think I'm slowly getting my head round all this now {but don't test me ;)}

Technicaly there is not much new to report some down workers have had new disks put in them. Plans are bing made to upgrade the worker and to finish sorting out Andy's legacy.

Main problem the building where the machine room is housed is a no access building site and I have been warned about a power outage in July.

Wednesday, 13 June 2007

Sheffield Update

I go away for a long weekend and we start failing SAM tests again. After a few email from Greig and some time waiting for the next tests we are now passing all the tests.

Our failings over the past few weeks seam to be down to one of 2 things cert upgrades not automatically working on all machines and me not knowing when and how to change the DN.

We have fixed the gridice information about disk sizes on the SE, as well as looking into adding more pools.

back to my day job

Friday, 8 June 2007

Lancaster Weekly Update

A busy week for Lancaster on the SE front. We had the "PNFS move", where the PNFS services were moved from the admin node onto their own host. There were complications, mainly caused by the fact that several key details were missing from the recipe I found was missing 1 or 2 key details that I overlooked when preparing for it.

I am going to wikify my fun and games, but essentially my problems can be summed up as:

Make sure in the node_config both the admin and pnfs nodes are marked down as "custom" in their node type. Keeping the admin node as "admin" causes it to want to run PNFS as well.

In the pnfs exports directory make sure the srm node is in there, and that on the srm node the pnfs directory on the pnfs node is mounted (similar to how door nodes are mounted-although not quite the same-to be honest I'm not sure I have it right but it seems to work.

Start things in the right order- the pnfs server on the PNFS node, the dcache-core services on the admin node, then the PNFSmanager on the PNFS node. I found that on a restart of the admin node services I had to restart the PNFSmanager. I'm not sure how I can fix this to enable automatic startups of our dcache services in the correct order.

Make sure that postgres is running on the admin node- it won't produce an error on startup if postgres isn't up (as it would have done if running pnfs on the node), but it will simply not work when you attempt transfers.

Don't do things with a potential to go wrong on a Friday afternoon if you can avoid it!

Since the move we have yet to see a significant performance increase, but then it's yet to be seriously challenged. We performed some more postgres housekeeping on the admin node after the move which made it a lot happier. Since the move we have noticed occasional srm SFT failures with a "permission denied" type failure, although checking things in the pnfs namespace we don't see any glaring ownership errors. I'm investigating it.

We have had some other site problems this week caused by the timing to be off on several nodes to be off by a good few minutes. It seems Lancaster's ntp server is unwell.

The room where we keep our Pool Nodes is suffering from heat issues. This always leaves us on edge, as our SE has had to be shut down before because of this, and the heat can make things flakey. Hopefully that machine room will get more cooling power and soon.

Other site news from Peter:
Misconfigured VO_GEANT4_SW_DIR caused some WNs to have a full /
partition, becoming blackholes. On top of this, a typo (extra quote) in
the site-info.conf caused lcg-env.sh to be messed up, failing jobs
immediately. Fixed now but flags up how sensitive the system is to
tweaks. Our most stable production month was when we implemented a
no-tweak policy.

Manchester Weekly Update

So far this week, we've had duplicate tickets from GGUS about a failure with dcache01 (affect the ce01 SAM tests), all transfers were stalling, I couldn't debug this as my certificate expired the day i returned from a week off and my dteam membership still hasn't been updated. Restarting the dcache headnode fixed this.
And this morning I discovered that a number of our Worker Nodes had full /scratch partitions, the problem has been tracked to a phenogrid user, and we're working with him on attempting to isolate the issue.