Northgrid-tech: August 2007

Sunday, 26 August 2007

Some new links about security

This article is an interesting example of how even someone with very little experience can still do some basic forensic.

http://blog.gnist.org/article.php?story=HollidayCracking

I added the link under the forensic section on the sys admin wiki

http://www.sysadmin.hep.ac.uk/wiki/Basic_Security#Forensic

Since I was at it I added a firewall section to

http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Firewall_configuration_and_Services_ports

Dcache Troubleshooting page

My tests on dcache started to fail for obscure reasons due to gsidcap doors misbehaving. I started a trouble shooting page for dcache

http://www.sysadmin.hep.ac.uk/wiki/DCache_Troubleshooting

Thursday, 23 August 2007

Another biomed user banned

Manchester has banned another biomed user for filling /tmp and causing trouble for other users. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=26147

Wednesday, 22 August 2007

Last Wednesday was a scheduled downtime for Lancaster in order to do the SL4 upgrade on the WNs, as well as some assorted spring cleaning of other services. We can safely say this was our worst upgrade experience so far, a single day turned into a three-day downtime. Fortunately for most, this was self-inflicted pain rather than middleware issues. The fabric stuff (PXE, kickstart) went fine, our main problem was getting consistent users.conf and groups.conf files for the YAIM configuration, especially with the pool sgm/prd accounts and the dns-style VO names such as supernemo.vo.eu-egee.org. The latest YAIM 3.1 documentation provides a consistent description but our CE still used the 3.0 version so a few tweaks were needed (YAIM 3.1 has since been released for glite 3.0). Another issue was due to our wise old CE (lcg-CE) having a lot of crust from previous installations, in particular some environment variables which affected the YAIM configuration such that the newer vo.d/files were not considered. Finally, we needed to ensure the new sgm/prd pool groups were added to torque ACLs but YAIM does a fine job with this should you choose to use it along with the _GROUP_ENABLE variables.

Anyway, things look good again with many biomed jobs, some atlas, dzero, hone and even a moderate number of lhcb jobs which supposed to have issues with SL4.

On the whole, the YAIM configuration went well although the VO information at CIC could still be improved with mapping requirements from VOMS groups to GID. LHCb provide a good example to other VOs, with explanations.

Monday, 20 August 2007

Week 33 start of 34

Main goings on have been the DMP update this was hampered by password version problems in MySQL, was resolved with help from here http://www.digitalpeer.com/id/mysql . More problem came with the change to BDii new firewall ports open and yet there was still no data coming out.

The BDii was going to be fixed today, however Sheffield has suffered several power cut over the last 24 hours. This has affected the hole of the LCG here, recovery work is ongoing.

Tuesday, 14 August 2007

GOCDB3 permission denied

I can't edit NorthGrid sites anymore. I opened a ticket.

https://gus.fzk.de/pages/ticket_details.php?ticket=25846

I would be mildly curious to know if other people are experiencing the same or if I'm the only one.

Monday, 13 August 2007

Manchester MON box overloaded again with a massive amount of CLOSE_WAIT connections.

https://gus.fzk.de/pages/ticket_details.php?ticket=25647

the problem seems to have been fixed but it affected the accounting for 2 or 3 days.

lcg_utils bug closed

Ticket about lcg_util bugs has been answered and closed

https://gus.fzk.de/pages/ticket_details.php?ticket=25406&from=allt

Correct version of rpms to install is

[aforti@niels003 aforti]$ rpm -qa GFAL-client lcg_util
lcg_util-1.5.1-1

GFAL-client-1.9.0-2

Update (2007/08/17): The problem was incorrect dependencies expressed in the meta rpms. Maarten opened a savannah bug.

https://savannah.cern.ch/bugs/?28738

Friday, 10 August 2007

Updating Glue schema

This is sort of old news as the request of updating the BDII is one month old.

To update the Glue schema you need to update the BDII on the BDII machine and on the CE and SE (dcache and classic). DPM SE uses BDII instead of globus-mds now so you should check the recipe for that.

The first problem I found was that

yum update glite-BDII

doesn't update the dependencies but only the meta-rpm. Apparently it works for apt-get but not for yum. So if you use yum you have 3 alternatives

1) yum -y update and risk to screw your machine
2) yum update and check each rpm
3) Look the list of rpms here

http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-BDII/3.0.2-12/glite-BDII-3.0.2-12.html

yum update

Reconfiguring the BDII doesn't pose a threat so you can

cd
./scripts/configure_node BDII_site

On the CE and SE... you can upgrade the CE and SE and reconfigure the nodes. But I didn't want to do that because you never know what might happen and with the farm full of jobs and the SE being dcache I don't see the point to risk it for a schema upgrade. So what follows is a simple recipe to upgrade the glue schema on CE and SE other than DPM without reconfiguing the nodes.

service globus-mds stop
yum update glue-schema
cd /opt/glue/schema
ln -s openldap-2.0 ldap
service globus-mds start

To check that it worked:

ps -afx -o etime,args | grep slapd

if your BDII is not on the CE and you find slapd instances on ports 2171-2173 it means you are running site BDIIs also on your CE and you should turn it off and remove it from the startup services.

The ldap link is needed because the schema path has changed and unless you want to edit the configuration file (/opt/globus/etc/grid-info-slapd.conf) the easiest thing is to add a link.

Most of this is in this ticket

https://gus.fzk.de/pages/ticket_details.php?ticket=24586&from=allt

including where to find the new schema documentation.

Thursday, 9 August 2007

Documentation for Manchester local users

Yesterday at a meeting with Manchester users who tried to use the grid it turned out that what they missed most is a page to collect the links of information sparse around the world (a common disease). As a consequence we have started pages to collect information useful to local users to use the grid.

https://www.gridpp.ac.uk/wiki/Manchester

Current links are of general usefulness. Users will add their own personal tips and tricks later.

Wednesday, 8 August 2007

How to check accounting is working properly

Obviously when you look at the accounting pages at the bottom there is a graph showing running VOs, but that is not straightforward. Other two ways are

The accounting enforcement page showing sites that are not publishing and for how many days they haven't published.

http://www3.egee.cesga.es/acctenfor

which I linked from

https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Accounting

or you could setup RSS feeds as suggested in the Apel FAQ.

I also created an Apel page with this information on the sysadmin wiki

http://www.sysadmin.hep.ac.uk/wiki/Apel

Monday, 6 August 2007

Progress on SL4

As part of our planned upgrade to SL4 at Manchester, we've been looking at getting dcache running.
The biggest stumbling block is a lack of glite-SE_dcache* profile, luckily it seems that all of the needed components apart from dcache-server are in the glite-WN profile. Even the GSIFtp Door appears to work.

Friday, 3 August 2007

Green fields of Lancaster

After sending the dcache problem the way of the dodo last week we've been enjoying 100% SAM test passes over the past 7 days. It's nice to have to do next to nothing to fill in your weekly report. Not a very exciting week otherwise, odd jobs and maintenance here and there. Our CE has been very busy the last week, which has caused occasional problems with the Steve Lloyd tests-we've had a few failures due to there being no job slots available, despite measures to prevent that. We'll see if we can improve things.

We're gearing up for the SL4 move- after Monday's very useful Northgrid meeting at Sheffield we have a time frame for it-sometime during the week starting the 13th of August. We'll pin it down to an exact day at the start of the coming week. We've took a worker offline as a guinea pig and will do hideous SL4 experimentations to it. The whole site will be in downtime for 9-5 on the day we do the move, with luck we won't need that long but we intend to use the time to upgrade the whole site (no SL3 kernels will be left within our domain). Lucky for us Manchester have offered to go first in Northgrid, so we'll have veterans of the SL4 upgrade nearby to call on for assistance.

Thursday, 2 August 2007

lcg-utils bugs

https://gus.fzk.de/pages/ticket_details.php?ticket=25406

Laptop reinstalled

EVO didn't work on my laptop. I reinstalled it with latest version of ubuntu and java 1.6.0. It works now. With my great disappointment facebook aquarium still doesn't ;-)

Fixed Manchester accouting

https://www.ggus.org/pages/ticket_details.php?ticket=25215

Glue Schema 2.0 use cases

Sent two broadcasts to collect Glue Schema use cases for the new 2.0 version. Received only two replies.

https://savannah.cern.ch/task/index.php?5229

How to kill a hanging job?

There is a policy being discussed about this. See:

https://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.doc

written by Graeme and Matt.

Part of the problem is that the user doesn't see any difference between a job that died and one that was killed by a system administrator. One of the request is to get the job wrapper catching the signal the standard tools send so that an appropriate message can be returned and possibly also some cleanup be done. This last part is being discussed at the TCG.

https://savannah.cern.ch/task/index.php?5221

SE downtime

Tried to publish GlueSEStatus to fix the SE downtime problem

https://savannah.cern.ch/task/?5222

Connected to this is ggus ticket

https://www.ggus.org/pages/ticket_details.php?ticket=24586

which was originally opened to get a recipe for sites to upgrade the BDII in a painless way.

VO deployment

Sent comments for final report of VO deployment WG to Frederic Schaer.

I wrote a report about this over a year ago:

https://mmm.cern.ch/public/archive-list/p/project-eu-egee-tcg/Why%20it%20is%20a%20problem%20adding%20VOs.-770951728.EML?Cmd=open

The comments in my email to the TCG are still valid.

I think the time to find the information for a vo is too short. It takes more than 30 mins and normally people ask other sys admins. I found the cic portal tool inadequate up to now. It would be better if the VOs maintained themselves a yaim snapshot on the cic portal that can be downloaded rather than inventing a tool. In UK that's the way we have chosen at the end to avoid this problem.

http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

This is maintained by sysadmins and it is only site-info.def. group.conf is not maintained by anyone but it should at the moment sysadmins are simply replicating the default, when a VO like LHCB or dzero deviates from that there is trouble.

2) YAIM didn't use to have a VO creation/deletion function for each service that can be run. It reconfigure the whole service that makes the
sys admins wary of adding a VO in production in case something went wrong in other parts. From you report this seems to be still the case.

Dashboard updated

Dashboard updated with new security advisories link
https://www.gridpp.ac.uk/wiki/Northgrid-Dashboard#Security_Advisories

Sheffield Jully looking back

July had to point of outage the worst being at the start of the month and just after a gLite 3.0 upgrade it did take a bit of time to find the problem and the solution.

Error message: /opt/glue/schema/ldap/Glue-CORE.schema: No such file or directory
ldap_bind: Can't contact LDAP server
Solution was found here: http://wiki.grid.cyfronet.pl/Pre-production/CYF-PPS-gLite3.0.2-UPDATE33

At the end of the month we has a strange error that was spotted quickly and turn out to be the result of a DNS server crash on the LGC here at Sheffield not resolving the worker nodes IPs

Sheffield hosted the monthly North grid meeting and all in all it was a good event.

Yesterday the LCG got it's own dedicated 1gig link to YHman and beyond we also now have our own firewall which will make changes quicker and easier.

Fun at Manchester SL4, lcg_util and pbs

In the midst of getting a successful upgrade-to-SL4 profile working, we upgraded our release of lcg_util from 1.3.7-5 to 1.5.1-1 this prooved to be unsuccessful, SAM test failures galore. After looking around for a solution on the internet I settled for rolling back to the previous version, thanks to the wonders of cfengine this didn't take long, and happily cfengine should be forcing that version on all nodes.

This Morning I came in to find we were again failing the SAM tests, this time the ever-so-helpful

"Cannot plan: BrokerHelper: no compatible resources"

Pointing to a problem deep in the depths of the batch system. Looking at our queues (via showq), there were a lot of Idle jobs yet more than enough CPUs. The PBS logs revealed a new error message,

Cannot execute at specified host because of checkpoint or
stagein files

for two of the jobs, eventually I managed to track it down to a node. Seeing as there wasn't any sign of the job file anymore, and pbs was refusing to re-run the job on another node, I had to resort to the trusty `qdel`, after thinking about it for the barest of moments, all of the Idle jobs woke up and started running.

Just for some gratuitous cross-linking, Steve Traylen appears to have provided a solution Over at the Scotgrid blog.

Northgrid-tech