Northgrid-tech

Thursday, 20 November 2008

WMS talk

I gave a talk about the WMS for the Manchester users benefit. It might be of interest to other people.

The talk can be found here:

WMS Overview

Monday, 10 November 2008

DPM "File Not Found"- but it's right there!

Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:
"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"

However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:

http://fal-pygrid-17.lancs.ac.uk:8123/ganglia/?r=day&sg=&c=LCG-ServiceNodes&h=fal-pygrid-30.lancs.ac.uk

The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.

Friday, 7 November 2008

Regional nagios

I installed a Regional nagios yesterday it turns out to be actually quite easy and the nagios group being quite helpful. I followed the tutorial given by Steve at EGEE08

https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial

I updated it while I was going along instead of writing a parallel document.

Below the URL of the test installation. It might get reinstalled few times to test other features in the next few days.

https://niels004.tier2.hep.manchester.ac.uk/nagios

Wednesday, 27 August 2008

DPM in Manchester

Manchester has now a fully working DPM with 6 TB. There are 2 space tokens ATLASPRODDISK and ATLASDATADISK. The service has been added to the GOC and the Information System, the space tokens are published. The errors have been corrected and the system has been passing the SAM tests continuously since yesterday.

I added some information to the wiki

https://www.gridpp.ac.uk/wiki/Manchester_DPM#Atlas_Space_tokens
https://www.gridpp.ac.uk/wiki/Manchester_DPM#Errors_along_the_path

Monday, 21 July 2008

DPM space tokens

Below the first tests of setting up space tokens on DPM testbed:

https://www.gridpp.ac.uk/wiki/Manchester_DPM

Tuesday, 3 June 2008

Slaughtering ATLAS jobs?

These heavy ion jobs have generated lots of discussion in various forums. They are ATLAS heavy ion simulations (Pb-Pb) which are being killed in two ways. (1) by the site's batch system if the queue walltime limit is reach. (2) by the atlas pilot because the log file modification time hasn't change in 24 hrs.

Either way, sites shouldn't worry if they see these, ATLAS production is aware. They're only single event and you might see massive mem usage too > 2G/core. :-)

According to Steve, the new WN should allow jobs to gracefully handle batch kills with a suitable delay between SIGTERM and SIGKILL.

Friday, 23 May 2008

buggy glite-yaim-core

glite-yaim-core to version >4.0.4-1 doesn't recognise anymore VO_$vo_VOMSES even if set correctly in the vo.d dir. I'm still wondering how the testing is performed. Primary functionalities like completion without self-evident errors seems to be overlooked. Anyway, looking on the bright side... the bug will be fixed in yaim-core 4.0.5-something. In the meantime I had to downgrade

glite-yaim-core to 4.0.3-13
glite-yaim-lcg-ce to 4.0.2-1
lcg-CE to 3.1.5-0

which was the combination previously working.

CE problems with lcas an update

lcas/lcmaps they have by default debug level set to 5. Apparently it can be changed setting appropriate env variables for the globus-lcas-lcamaps interface. The variables are actually foreseen in yaim. The errors can be generated by a mistyped DN when a non-VOMS proxy is used. This is a very easy way to generate a DoS attack.

After dwngrading yaim/lcg-CE I've reconfigured the CE and it seems to be working now. I haven't seen any of the debug messages for now.

Thursday, 22 May 2008

globus-gatekeeper weirdness

globus-gatekeeper has started to spit out level 4 lcas/lcmaps messages from nowhere at 4 o'clock in the morning. The log file reaches few GBs size in few hours and fills /var/log screwing the CE. I contacted nikhef people for help but haven't received an answer yet. The documentation is not helpful.

Monday, 19 May 2008

and again

we have upgraded to 1.8. At least we can look at the space manager while waiting for a solution to the number of jobs cap and the replica manager not replicating.

Thursday, 15 May 2008

Still dcache

With Vladimir help Sergey managed to start the replica manager changing a java option in replica.batch. This is as far as it goes because it still doesn't work, i.e. it doesn't produce replicas. We have just given Vladimir access to the testbed.

It seems Chris Brew is having the same 'cannot run more than 200 jobs' problem since he upgraded. He sent an email to the dcache user-forum. This makes me think that even if the replica manager might help it will not cure the problem.

Tuesday, 13 May 2008

Manchester dcache troubles (2)

fnal developers are looking at the replica manager issue. The error lines found in the admindomain log appear also at fnal and don't seem to be a problem there. The search continues...

In the meantime we have doubled the memory of all dcache head nodes.

Monday, 12 May 2008

The data centre in the sky

Sent adverts out this week, trying to shift our ancient DZero farm which was de-commissioned a few years ago. It had a glorious past with many production records set whilst generating DZero's RunII simulation data. It's dual 700MHz PentiumIII CPUs, with 1G RAM can't cope with much these days, and it's certainly not worth the manpower keeping them online. Here is the advert if you're interested.

In other news, our MON box system disk spat it's dummy over the weekend, this was one of the three gridpp machines, not bad going after 4 years.

Tuesday, 6 May 2008

Manchester SL4 dcache troubles

Manchester since the upgrade to SL4 is experiencing problems with dcache.

1) pnfs doesn't seem to take a load beyond 200 atlas jobs (it times out) Alessandra has been unable to replicate the problem production is seeing. Even starting 200 clients at the same time on the same file production is using all she could see was a transfer time increased from 2 secons to ~190 seconds but no time out. On saturday when she looked on the dashboard she has found 99.8% of ~1790 jobs successfully completed in the last 24 hours which also sounds in contradiction with the 200 jobs at the time statistics and needs to be explained.

2) replica manager doesn't work anymore, i.e. it doesn't even start, no resilience active. The error is a java InitSQL that should have been caused by the lack of a parameter according to the dcache developers. We sent them the requested configuration files and they couldn't find anything wrong with them. We have given access to dcache to Greig and he couldn't see anything wrong either. A developer suggested to move to a newer version of dcache to solve the problem, which we had tried already, but the new version has a new problem. From the errors it seems that the schema has changed, but we didn't obtain a reply to this. In this instance the replica manager starts but cannot insert data in the database. The replica manager obviously helps to cut in half transfer times because there is more than one pool node serving the files (I tested this on the SL3 dcache. 60 concurrent clients take max 35 sec each instead of 70. If the number of clients increases the effect is less but still in the range of 30%. In any case we are talking about a fistful of seconds not in the timeout range as it happens to production.

3) Finally even if all these problems were solved Space Manager isn't compatible with Resilience. So pools with space tokens will not have the benefit of duplicates. Alessandra has asked already 2 months ago what was the policy in case she had to chose. It was agreed that for these initial tests it wasn't a problem.

4) Another problem specific to Atlas is that although Manchester has 2 dcache
instances they have insisted to use only 1 for quite sometime. This has obviously affected production heavily. After a discussion at CERN they agreed finally to split and use both instances but that hasn't happened yet.

5) This is minor but equally important for manchester: VO with DNS style names are mishandled by dcache YAIM. We will open a ticket.

We have applied all the optimization suggested by the developers. Even those not necessary and nothing has changed. the old dcache instance without optimizations and with the replica manager working is taking a load of 300-400
atlas users jobs. According to local users who are using it for their local production both reading from it and writing into it they have an almost 100% rate of success (last week 7 jobs failures over 2000 jobs submitted).

Applied optimizations:

1) Split pnfs from the dcache head node: we can now run 200 production jobs. (but then again as already said the old dcache can take 400 jobs and the head node isn't split)
2) Apply postgres optimizations: no results
3) Apply kernel optimization for networking from CERN: transfers of small files
30% faster but could also be a less loaded cluster.

Most of the problems might come from the attempt of maintaining the old data. So we will try to to install a new dcache instance without it. Although it is not a very sustainable choice, it might help to understand what is the problem.

Wednesday, 9 April 2008

Athena release 14 - new dependency

Athena release 14 has a new dependency on package 'libgfortran'. Sites with local Atlas users may want to check they have this. The runtime error message is rather difficult to decipher, however the buildtime error is explicit. I've added the package to the required packages twiki page.

Monday, 4 February 2008

Power outage

Site-wide power outage occured at Lancaster Uni this evening. The juice is now flowing but some intervention is required tomorrow morning before we're back to normal operations.

I hate java!

#$%^&*@!!!!!!

Wednesday, 30 January 2008

Liverpool update

From Mike reply:

* We'll stay with dcache and are about to rebuild the whole SE (and the whole cluster including a new multi-core CE) when we shut down for a week soon to install SL4. Everything is under test at present and we are upgrading the rack software servers to 250GB RAID1 to cope with the 100GB size of the ATLAS code.

* We are still testing Puppet (on our non-LCG cluster) as our preferred solution. It looks fine but we are not yet sure it will scale to many 100s of nodes.

Thursday, 20 December 2007

Lancaster's Winter Dcache Dramas

It's been a tough couple of months for Lancaster, with our SE giving us a number of problems.

Our first drama, at the start of the month, was caused by unforseen complications with our upgrade to dcache 1.8. Knowing that we were low on the support list due to being only a Tier 2, but feeling emboldened by the highly useful srm 2.2 workshop in Edinburgh and the good few years we've spent in the dcache trenches we decided to take the plunge. And faced a good few days of downtime beyond the one we had scheduled as we faced down a number of bugs with the early versions of dcache 1.8 (fixed by upgrading to higher patch levels), then faced problems due to changes in the gridftp protocol highlighted inconsisencies with the users on our pnfs node and gridftp door node. Due to a hack long ago several VOs had different user.conf entries and therefore UIDs on our door nodes and pnfs node. This never caused problems before, but after the upgrade the doors were passing the uids to the pnfs node so new files and directories were created with the correct group (as the gids were consistent) but a wrong uid, causing permission troubles whenever a delete was called. This was a classic case of a problem that was hell to the cause behind but once figured out was thankfully easy to solve. Once we fixed that one it was green tests for a while.

Then dcache drama number two came along a week later- a massive postgres db failure on our pnfs node. The postgres database contains all the information that dcache uses to match the fairly anonymously named files on the poolnodes to entries in the pnfs namespace- without it dcache has no idea which files are which, so with it bust the files are almost as good as lost. Which is why it should be backed up regularly. We did this twice daily, as least we thought we did- a cron problem had meant that our backups hadn't been made for a while and a rollback to it would mean a fair amount of data might be lost. So we spent 3 days doing arcane sql rituals to try and bring back the database, but it had too heavily corrupted itself and we had to rollback.

The cause of the database crash and corruption was a "wrap around" error. Postgres requires regular "vacuuming" to clean up after itself, otherwise it essentially starts writing over itself. This crash took us by surprise, as we not only have postgres looking after itself with optimised auto-vacuuming occuring regularly, but during the 1.8 upgrade I took the time to do a manual full vacuum, which was only a week before this one. Also postgres is designed to freeze in the event of being at risk of a wraparound error rather then overwrite itself, and this didn't happen. The first we heard of it pnfs and postgres had stopped responding and there were wraparound error messages in the logs, no warning of the impending disaster.

Luckily the data rollback seems to have not affected the VOs too much. We had one ticket from Atlas, who after we explained our situation to them handily cleaned up their file catalogues. The guys over at dcache hinted at a possible way of rebuilding the lost databases from the pnfs logs, although sadly this isn't simply a case of recreating pnfs related sql entries and they've been too busy with Tier 1 support to look into this further.

Since then we've fixed our backups and applied a nagios test to ensure the backups are less then a day old-the biggest trouble here was that the reluctance to use an old backup meant we wasted over 3 days banging our heads trying to bring back a dead database rather then a few hours it would take to restore from backup and verify things were working. And it appears the experiments were more affected by us being in downtime then by the loss of easily replicatable data. In the end I think I caused more trouble going over the top on my data recovery attempts then if I had been gung ho and used the old backup once things looked a bit bleak for the remains of the postgres database. At least we've now set things up so the likeliness of it happening again is slim, but the circumstances behind the original database errors are still unknown, which leaves me a little worried.

Have a good Winter Festival and Holday Season everyone- but before you head off to your warm fires and cold beers check the age of your backups just in case...

Thursday, 6 December 2007

Manchester various

- core path was set to /tmp/core-various-param in sysctl.conf and was creating a lot of problems to dzero jobs. It was also creating problems to others as they were filling /tmp and consequently maradona errors were looming around. The path has been changed back to the default and also I set core size 0 in limits.conf to prevent any other problem repeating itself with a lesser degree in /scratch.

- dcache doors were open on the wrong nodes. node_config is the correct one but it was copied before stopping dcache-core service and now /etc/init.d/dcache-core stop doesn't have any effect. The doors have also a keep alive script so it is not enough to kill the java proesses one has to kill also the parents.

- cfengine config files are being rewritten to make them less criptic.