Northgrid-tech: October 2007

Wednesday, 31 October 2007

Manchester hat tricks

Manchester CE ce02 has been blacklisted by atlas since yesterday because it fails
the ops tests and therefore it is also failing the Steve lloyds tests and has avaialbility 0. However there is no apparent reason why these tests should fail. Besides ce02 is doing some magic: there were 576 jobs running from 5 different VOs when I started writing this among which atlas production jobs, and now there 12 hours later there are 1128. I'm baffled.

Tuesday, 30 October 2007

Regional VOs

vo.northgrid.ac.uk
vo.southgrid.ac.uk

both created no users yet in them. Need to enable them at sites probably to get more progress.

user certificates: p12 to pem

since I was renewing my certificate I added a small script (p12topem.sh) to the subversion repository to convert users p12 certificates into pem format and set their unix permission correctly. i lnked it from here:

https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance

It assumes $HOME/.globus/user*.pem names. it doesn't therefore handle host certificates but could be easily extended.

Monday, 29 October 2007

Links to monitoring pages update

I added three links to the FCR one per experiment with all the UK sites selected. It hopefully will make easier to find out who has been blacklisted.

http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages

I also added GridMap link and linked Steve monitoring both as generic dteam and atlas plus the quarters summary plots.

Friday, 19 October 2007

Sheffield latest

Trying to stabilize Sheffield cluster.
After the scheduled power outage the nodes didn't restart properly and some of the old jobs needed to be cleaned up. After that the cluster was ok apart from the BDII dropping out. We have applied the famous Kostas patch

https://savannah.cern.ch/bugs/?16625

which is getting into the release after 1.5 years Hurray!!!

The stability of the BDII has improved and DPM seems stable. The SAM tests have been stable over the weekend and the today Steve Atlas tests showed a 96% availability which is a big improvement. However the cluster filled up this morning and the instability reappeared, sign that there is still something to fix on the worker nodes and in the scheduling. Added a reservation for ops and looking at the WNs some of which were re-imaged this morning.

Thursday, 18 October 2007

Manchester availability

SAM tests both ops and atlas, were failing due to dcache problems. Part of it was due to the fact that Judit has changed her DN and somehow the cron job to build the dcache kpwd file wasn't working. In addition to that dcahce02 had to be restarted (both core and pnfs) as usual it started to work again after that without any apparent reason of why it failed in the first place. gPlazma is not enabled yet.

Mostly that's the reason of the drop in october.

Monday, 15 October 2007

Availability update

Lancs site availability looks OK for the last month at 94% which is 13% above the GridPP average, and this includes a couple of weekends lost due to dCache problems. The record from July-September has been updated on Jeremy's page. We still get the occasional failed SAM submission, no idea what causes these but they serve to deny the availability reaching high nineties.

June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed.
mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs.
mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood.

Job slot occupancy looks ok, non-HEP VOs like fusion and biomed helping to fill slots left by moderate production by Atlas.

Friday, 12 October 2007

Sys Admin Requests wiki pages

YAIM has a new wiki page for sys admins requests. Maria has sent an announcement to the LCG-ROLLOUT. I added, for bookkeeping, a link and explanations in the sys admin wiki wishlist page where also the ROCs admins management tools requests is linked from.

http://www.sysadmin.hep.ac.uk/wiki/Wishlist

Tuesday, 9 October 2007

BDII doc page

After the trouble sheffield went through with the BDII I started a BDII page in the sysadmin wiki.

http://www.sysadmin.hep.ac.uk/wiki/BDII

Monday, 8 October 2007

Manchester RGMA fixed

Fixed RGMA in Manchester. It had, for still obscure reasons, wrong permissions on the host key files. Started a RGMA troubleshooting page on sysadmin wiki:

http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA

EGEE '07

EGEE conference. I've given a talk in the SA1-JRA1 session which seems to have had a positive result which will hopefully have some follow up.

Talk can be found at

http://indico.cern.ch/materialDisplay.py?contribId=30&sessionId=49&materialId=slides&confId=18714

and is the sibling of the one we gave in Stockholm at the Ops workshop on problems with SA3 and within SA1.

http://indico.cern.ch/contributionDisplay.py?contribId=25&confId=12807

which had some follow up with SA3 that can be found here

https://savannah.cern.ch/task/?5267

Its alive !

Sheffield problems reviewed

During the last update to the DPM I ran in to several problem.

1. DPM update failed due to changes in the way the password are
stored in mysql
2. A miss understandind with the new version of yaim that rolled out
at the same time
3. Config errors with the sBDii
4. mds-vo-name
5. Too many roll outs in one go for me to have a clue which broke and where to start
looking.

DPM update fails
I would like to thank Graeme for the great update instructions, they
helped lots. The problems came when the update script used a
different hashing method to the one used by mysql problem found here
http://<>. This took some finding, it also means every
time we run yaim config on the SE we have to go back and fixs the
passwords again, this is because yain still uses the old hash not the
new one.

Yaim update half way and congig errors
This confused the hell out of me one minute I'm using yaim scripts to
run updates. Next I have an updated version of yaim that I had to
pass flags to and is where I guess I started to make the mistakes that
lead to me setting the SE as a sBDii. After getting lost with the new
yain I told the wrong machine that it was a sBDii and never relised.

mds-vo-name
With the help of Henry, we found out that our information was wrong ie we had

mds-vo-name=local it is now mds-vo-name=resource

Once this was changed in the site-info.def and yaim was re ran on our mon box which is
also out sBDii it alll seamed to work.