Friday, 23 May 2008

buggy glite-yaim-core

glite-yaim-core to version >4.0.4-1 doesn't recognise anymore VO_$vo_VOMSES even if set correctly in the vo.d dir. I'm still wondering how the testing is performed. Primary functionalities like completion without self-evident errors seems to be overlooked. Anyway, looking on the bright side... the bug will be fixed in yaim-core 4.0.5-something. In the meantime I had to downgrade

glite-yaim-core to 4.0.3-13
glite-yaim-lcg-ce to 4.0.2-1
lcg-CE to 3.1.5-0

which was the combination previously working.

CE problems with lcas an update

lcas/lcmaps they have by default debug level set to 5. Apparently it can be changed setting appropriate env variables for the globus-lcas-lcamaps interface. The variables are actually foreseen in yaim. The errors can be generated by a mistyped DN when a non-VOMS proxy is used. This is a very easy way to generate a DoS attack.

After dwngrading yaim/lcg-CE I've reconfigured the CE and it seems to be working now. I haven't seen any of the debug messages for now.

Thursday, 22 May 2008

globus-gatekeeper weirdness

globus-gatekeeper has started to spit out level 4 lcas/lcmaps messages from nowhere at 4 o'clock in the morning. The log file reaches few GBs size in few hours and fills /var/log screwing the CE. I contacted nikhef people for help but haven't received an answer yet. The documentation is not helpful.

Monday, 19 May 2008

and again

we have upgraded to 1.8. At least we can look at the space manager while waiting for a solution to the number of jobs cap and the replica manager not replicating.

Thursday, 15 May 2008

Still dcache

With Vladimir help Sergey managed to start the replica manager changing a java option in replica.batch. This is as far as it goes because it still doesn't work, i.e. it doesn't produce replicas. We have just given Vladimir access to the testbed.

It seems Chris Brew is having the same 'cannot run more than 200 jobs' problem since he upgraded. He sent an email to the dcache user-forum. This makes me think that even if the replica manager might help it will not cure the problem.

Tuesday, 13 May 2008

Manchester dcache troubles (2)

fnal developers are looking at the replica manager issue. The error lines found in the admindomain log appear also at fnal and don't seem to be a problem there. The search continues...

In the meantime we have doubled the memory of all dcache head nodes.

Monday, 12 May 2008

The data centre in the sky

Sent adverts out this week, trying to shift our ancient DZero farm which was de-commissioned a few years ago. It had a glorious past with many production records set whilst generating DZero's RunII simulation data. It's dual 700MHz PentiumIII CPUs, with 1G RAM can't cope with much these days, and it's certainly not worth the manpower keeping them online. Here is the advert if you're interested.

In other news, our MON box system disk spat it's dummy over the weekend, this was one of the three gridpp machines, not bad going after 4 years.

Tuesday, 6 May 2008

Manchester SL4 dcache troubles

Manchester since the upgrade to SL4 is experiencing problems with dcache.

1) pnfs doesn't seem to take a load beyond 200 atlas jobs (it times out) Alessandra has been unable to replicate the problem production is seeing. Even starting 200 clients at the same time on the same file production is using all she could see was a transfer time increased from 2 secons to ~190 seconds but no time out. On saturday when she looked on the dashboard she has found 99.8% of ~1790 jobs successfully completed in the last 24 hours which also sounds in contradiction with the 200 jobs at the time statistics and needs to be explained.

2) replica manager doesn't work anymore, i.e. it doesn't even start, no resilience active. The error is a java InitSQL that should have been caused by the lack of a parameter according to the dcache developers. We sent them the requested configuration files and they couldn't find anything wrong with them. We have given access to dcache to Greig and he couldn't see anything wrong either. A developer suggested to move to a newer version of dcache to solve the problem, which we had tried already, but the new version has a new problem. From the errors it seems that the schema has changed, but we didn't obtain a reply to this. In this instance the replica manager starts but cannot insert data in the database. The replica manager obviously helps to cut in half transfer times because there is more than one pool node serving the files (I tested this on the SL3 dcache. 60 concurrent clients take max 35 sec each instead of 70. If the number of clients increases the effect is less but still in the range of 30%. In any case we are talking about a fistful of seconds not in the timeout range as it happens to production.

3) Finally even if all these problems were solved Space Manager isn't compatible with Resilience. So pools with space tokens will not have the benefit of duplicates. Alessandra has asked already 2 months ago what was the policy in case she had to chose. It was agreed that for these initial tests it wasn't a problem.

4) Another problem specific to Atlas is that although Manchester has 2 dcache
instances they have insisted to use only 1 for quite sometime. This has obviously affected production heavily. After a discussion at CERN they agreed finally to split and use both instances but that hasn't happened yet.

5) This is minor but equally important for manchester: VO with DNS style names are mishandled by dcache YAIM. We will open a ticket.

We have applied all the optimization suggested by the developers. Even those not necessary and nothing has changed. the old dcache instance without optimizations and with the replica manager working is taking a load of 300-400
atlas users jobs. According to local users who are using it for their local production both reading from it and writing into it they have an almost 100% rate of success (last week 7 jobs failures over 2000 jobs submitted).

Applied optimizations:

1) Split pnfs from the dcache head node: we can now run 200 production jobs. (but then again as already said the old dcache can take 400 jobs and the head node isn't split)
2) Apply postgres optimizations: no results
3) Apply kernel optimization for networking from CERN: transfers of small files
30% faster but could also be a less loaded cluster.

Most of the problems might come from the attempt of maintaining the old data. So we will try to to install a new dcache instance without it. Although it is not a very sustainable choice, it might help to understand what is the problem.