Tuesday 6 May 2008

Manchester SL4 dcache troubles

Manchester since the upgrade to SL4 is experiencing problems with dcache.

1) pnfs doesn't seem to take a load beyond 200 atlas jobs (it times out) Alessandra has been unable to replicate the problem production is seeing. Even starting 200 clients at the same time on the same file production is using all she could see was a transfer time increased from 2 secons to ~190 seconds but no time out. On saturday when she looked on the dashboard she has found 99.8% of ~1790 jobs successfully completed in the last 24 hours which also sounds in contradiction with the 200 jobs at the time statistics and needs to be explained.

2) replica manager doesn't work anymore, i.e. it doesn't even start, no resilience active. The error is a java InitSQL that should have been caused by the lack of a parameter according to the dcache developers. We sent them the requested configuration files and they couldn't find anything wrong with them. We have given access to dcache to Greig and he couldn't see anything wrong either. A developer suggested to move to a newer version of dcache to solve the problem, which we had tried already, but the new version has a new problem. From the errors it seems that the schema has changed, but we didn't obtain a reply to this. In this instance the replica manager starts but cannot insert data in the database. The replica manager obviously helps to cut in half transfer times because there is more than one pool node serving the files (I tested this on the SL3 dcache. 60 concurrent clients take max 35 sec each instead of 70. If the number of clients increases the effect is less but still in the range of 30%. In any case we are talking about a fistful of seconds not in the timeout range as it happens to production.

3) Finally even if all these problems were solved Space Manager isn't compatible with Resilience. So pools with space tokens will not have the benefit of duplicates. Alessandra has asked already 2 months ago what was the policy in case she had to chose. It was agreed that for these initial tests it wasn't a problem.

4) Another problem specific to Atlas is that although Manchester has 2 dcache
instances they have insisted to use only 1 for quite sometime. This has obviously affected production heavily. After a discussion at CERN they agreed finally to split and use both instances but that hasn't happened yet.

5) This is minor but equally important for manchester: VO with DNS style names are mishandled by dcache YAIM. We will open a ticket.

We have applied all the optimization suggested by the developers. Even those not necessary and nothing has changed. the old dcache instance without optimizations and with the replica manager working is taking a load of 300-400
atlas users jobs. According to local users who are using it for their local production both reading from it and writing into it they have an almost 100% rate of success (last week 7 jobs failures over 2000 jobs submitted).

Applied optimizations:

1) Split pnfs from the dcache head node: we can now run 200 production jobs. (but then again as already said the old dcache can take 400 jobs and the head node isn't split)
2) Apply postgres optimizations: no results
3) Apply kernel optimization for networking from CERN: transfers of small files
30% faster but could also be a less loaded cluster.

Most of the problems might come from the attempt of maintaining the old data. So we will try to to install a new dcache instance without it. Although it is not a very sustainable choice, it might help to understand what is the problem.

No comments: