Monday, 19 November 2007

Manchester black holes for atlas

Atlas job failing because of the following errors:
====================================================
All jobs fail because 2 bad nodes fail like
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found
/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory
/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory
submit-helper script running on host bohr1428 gave error: could not add entry in the local gass cache for stdout
===================================================
Problem caused by

${GLOBUS_LOCATION}/libexec/globus-script-initializer

being empty.

Tuesday, 13 November 2007

Some SAM tests don't respect downtime

Sheffield is shown to fail the CE-host-cert-valid test while in downtime. SAM tests should all behave the same. This is on top of the very confusing display of the results in alternate lines. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=28983

Sunday, 11 November 2007

Manchester sw repository reorganised

To simplify the maintainance of multiple releases and architectures I reorganised the software (yum) repository in Manchester.

While before we had to maintain a yum.conf for each release and architecture now we just need to add links to the right place. I wrote a recipe on my favourite site:

http://www.sysadmin.hep.ac.uk/wiki/Yum

This will allow to remove also the complications introduced in cfengine conf files to maintain multiple yum.conf versions.

Friday, 9 November 2007

1,000,000th job has passed by

This week the batch system ticked over it's millionth job. The lucky user was biomed005, and no, it was nothing to do with rsa768. The 0th job was way back in August 2005 when we replaced the old batch system with torque. How many of these million were successful? I shudder to think but I'm sure it's improving :-)

In other news, we're having big problems with our campus firewall, it blocks outgoing port 80 and 443 to ensure that traffic passes through the university proxy server. Unfortunately some web clients such as wget and curl make it impossible to use the proxy for these ports whilst bypassing the proxy for all other ports. Atlas need this with the new PANDA pilot job framework. We installed a squid proxy of our own (good idea Graeme) which allows for greater control. No luck with handling https traffic so we really need to get a hole punched in the campus firewall. I'm confident the uni systems guys will oblige ;-)

Sheffield in downtime

Sheffield has been put in downtime until Monday 12/11/2007 at 5 pm.


Reason: Power cut affecting much of central sheffield. Substation exploded. Not even allowed inside the physics building.

Matt is also back in the GOCDB now as site admin.

Saturday, 3 November 2007

Manchester CEs and RGMA problems

Still don't know what happened to ce02 and why ops test didn't work and my jobs hanged forever while anybody else could run (atlas claims 88 percent efficiency in the last 24 hours). Anyway I updated ce02 manually (rpm -ihv) to the same set of rpms that are on ce01 and now the problem I had, globus hanging, has disappeared . The ops tests are successful again and we got out of the atlas blacklisting. I fixed also ce01 that yesterday picked up the wrong java version. I need to change a couple of things on the kickstart server so that these incidents don't happen.

Also I had to manually kill tomcat which was not responding and restart it on the MON box. Accounting published successfully after this.

Friday, 2 November 2007

Sheffield accounting

From Matt:

/opt/glite/bin/apel-pbs-log-parser
is trying o contact the ce on 2170, I think expecting the site bdii to be there.
I changed ce_node</GII> to mon_node in /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml
and now thing seem much improved.

However, I am getting this

Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Record/s found: 8539
Fri Nov 2 13:48:39 UTC 2007: apel-publisher - Checking Archiver is Online
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Unable to retrieve any response while querying the GOC
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - Archiver Not Responding: Please inform apel-support@listserv.cclrc.ac.uk
Fri Nov 2 13:49:40 UTC 2007: apel-publisher - WARNING - Received a 'null' result set while querying the 'LcgRecords' table using rgma, this probably means the GOC is currently off-line, will therefore cancel attempt to re-publish

running /opt/glite/bin/apel-publisher on the mon box.

I the goc machine is really off-line, I'll have to wait to publish the missing data for sheffield.

Manchester SL4

Not big news for other sites but I have installed an SL4 UI in manchester. Still 32bit cause the UIs at the Tier2 are old machines. However I'd like to express my relief that once the missing 3rd parties rpms were in place the installation
went smoothly.

After some struggling with cfengine keys which I was dispairing tosolve by the end of the evening I managed to install also a WN 32bit. At least cfengine doesn't give any more errors and runs happily.

Tackling dcache now and the new yaim structure.

Thursday, 1 November 2007

Sheffield

Quiet night for sheffield after reimaged nodes where taken offline in PBS. Matt also increased the number of ssh connections allowed on the CE from 10 to 100 to reduce the time outs between the WN and CE and reduce the incidence of Maradona errors.

Wednesday, 31 October 2007

Manchester hat tricks

Manchester CE ce02 has been blacklisted by atlas since yesterday because it fails
the ops tests and therefore it is also failing the Steve lloyds tests and has avaialbility 0. However there is no apparent reason why these tests should fail. Besides ce02 is doing some magic: there were 576 jobs running from 5 different VOs when I started writing this among which atlas production jobs, and now there 12 hours later there are 1128. I'm baffled.

Tuesday, 30 October 2007

Regional VOs

vo.northgrid.ac.uk
vo.southgrid.ac.uk

both created no users yet in them. Need to enable them at sites probably to get more progress.

user certificates: p12 to pem

since I was renewing my certificate I added a small script (p12topem.sh) to the subversion repository to convert users p12 certificates into pem format and set their unix permission correctly. i lnked it from here:

https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance

It assumes $HOME/.globus/user*.pem names. it doesn't therefore handle host certificates but could be easily extended.

Monday, 29 October 2007

Links to monitoring pages update

I added three links to the FCR one per experiment with all the UK sites selected. It hopefully will make easier to find out who has been blacklisted.

http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages


I also added GridMap link and linked Steve monitoring both as generic dteam and atlas plus the quarters summary plots.

Friday, 19 October 2007

Sheffield latest

Trying to stabilize Sheffield cluster.
After the scheduled power outage the nodes didn't restart properly and some of the old jobs needed to be cleaned up. After that the cluster was ok apart from the BDII dropping out. We have applied the famous Kostas patch

https://savannah.cern.ch/bugs/?16625


which is getting into the release after 1.5 years Hurray!!!

The stability of the BDII has improved and DPM seems stable. The SAM tests have been stable over the weekend and the today Steve Atlas tests showed a 96% availability which is a big improvement. However the cluster filled up this morning and the instability reappeared, sign that there is still something to fix on the worker nodes and in the scheduling. Added a reservation for ops and looking at the WNs some of which were re-imaged this morning.

Thursday, 18 October 2007

Manchester availability

SAM tests both ops and atlas, were failing due to dcache problems. Part of it was due to the fact that Judit has changed her DN and somehow the cron job to build the dcache kpwd file wasn't working. In addition to that dcahce02 had to be restarted (both core and pnfs) as usual it started to work again after that without any apparent reason of why it failed in the first place. gPlazma is not enabled yet.

Mostly that's the reason of the drop in october.

Monday, 15 October 2007

Availability update

Lancs site availability looks OK for the last month at 94% which is 13% above the GridPP average, and this includes a couple of weekends lost due to dCache problems. The record from July-September has been updated on Jeremy's page. We still get the occasional failed SAM submission, no idea what causes these but they serve to deny the availability reaching high nineties.

  • June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed.
  • mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs.
  • mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood.
Job slot occupancy looks ok, non-HEP VOs like fusion and biomed helping to fill slots left by moderate production by Atlas.

Friday, 12 October 2007

Sys Admin Requests wiki pages

YAIM has a new wiki page for sys admins requests. Maria has sent an announcement to the LCG-ROLLOUT. I added, for bookkeeping, a link and explanations in the sys admin wiki wishlist page where also the ROCs admins management tools requests is linked from.

http://www.sysadmin.hep.ac.uk/wiki/Wishlist

Tuesday, 9 October 2007

BDII doc page

After the trouble sheffield went through with the BDII I started a BDII page in the sysadmin wiki.

http://www.sysadmin.hep.ac.uk/wiki/BDII

Monday, 8 October 2007

Manchester RGMA fixed

Fixed RGMA in Manchester. It had, for still obscure reasons, wrong permissions on the host key files. Started a RGMA troubleshooting page on sysadmin wiki:

http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA

EGEE '07

EGEE conference. I've given a talk in the SA1-JRA1 session which seems to have had a positive result which will hopefully have some follow up.

Talk can be found at

http://indico.cern.ch/materialDisplay.py?contribId=30&sessionId=49&materialId=slides&confId=18714


and is the sibling of the one we gave in Stockholm at the Ops workshop on problems with SA3 and within SA1.

http://indico.cern.ch/contributionDisplay.py?contribId=25&confId=12807

which had some follow up with SA3 that can be found here

https://savannah.cern.ch/task/?5267