Friday, 20 July 2007

Replica manager failure solution? Permission denied

We're still facing down the same dcache problem. Consultation with the experts has directed us to the fact that the problem isn't (directly) a gPlazma one-the request are being assigned to the wrong user (or no user) and suffering permission problems. Even if immediately preceding and proceding access requests of a similar nature with the same proxy succeed without a glitch. The almost complete lack of a sign of this problem in the logs is increasing the frustration. I've sent the developers every config file they've asked for and some they didn't, upped the logging levels and scoured logs till my eyes were almost bleeding. And I haven't a clue how to fix this thing. There's no real pattern-other then the fact that failures seem to happen more often during the working day (but then it's a small statistical sample)--- there are no corresponding load spikes. We have a ticket open against us about this and the word quarantine got mentioned-which is never a good thing. It sometimes feels like we have a process running in our srm that rolls a dice every now and again and if it comes up as a 1 we fail our saving throw vs. SAM test failure. If we could just find a pattern or cause then we'd be in such a better position. All we can do is keep scouring and maybe the odd tweak and see if something presents itself.

