Northgrid-tech: NorthGrid

Showing posts with label NorthGrid. Show all posts

Monday, 15 December 2008

groups.conf syntax

Elena asked about it few days ago on TB-SUPPORT. Today I investigated a bit further and the result is that for glite-yaim-core versions >4.0.4-1:

* Even if it still works the syntax with VO= and GROUP= is obsolete. The new syntax is much simpler as it uses directly the FQANs as reported in the VO cards (if they are maintained).

* The syntax in /opt/glite/yaim/examples/groups.conf.example is correct and the files in the directory are kept up to date with the correct syntax although the examples might not be valid.

* Further information can be found either in

/opt/glite/examples/groups.conf.README

or

https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Group_configuration_in_YAIM

which is worth to periodically review for changes.

Tuesday, 25 November 2008

Regional nagios update

I reinstalled the regional nagios with Nagios3 and it works now.

https://niels004.tier2.hep.manchester.ac.uk/nagios

As suggested by Steve I'm also trying the nagios checker plugin

https://addons.mozilla.org/en-US/firefox/addon/3607

instead of the email notification but I still have to configure things properly. At the moment firefox makes some noise every ~30 seconds and there is also a visual alert on the bottom right corner of the firefox window with the number of services in critical state which expands to show the services when the cursor points at it. Really nice. :)

Thursday, 20 November 2008

WMS talk

I gave a talk about the WMS for the Manchester users benefit. It might be of interest to other people.

The talk can be found here:

WMS Overview

Friday, 7 November 2008

Regional nagios

I installed a Regional nagios yesterday it turns out to be actually quite easy and the nagios group being quite helpful. I followed the tutorial given by Steve at EGEE08

https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial

I updated it while I was going along instead of writing a parallel document.

Below the URL of the test installation. It might get reinstalled few times to test other features in the next few days.

https://niels004.tier2.hep.manchester.ac.uk/nagios

Monday, 4 February 2008

I hate java!

#$%^&*@!!!!!!

Tuesday, 13 November 2007

Some SAM tests don't respect downtime

Sheffield is shown to fail the CE-host-cert-valid test while in downtime. SAM tests should all behave the same. This is on top of the very confusing display of the results in alternate lines. I opened a ticket.

https://gus.fzk.de/ws/ticket_info.php?ticket=28983

Tuesday, 30 October 2007

user certificates: p12 to pem

since I was renewing my certificate I added a small script (p12topem.sh) to the subversion repository to convert users p12 certificates into pem format and set their unix permission correctly. i lnked it from here:

https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance

It assumes $HOME/.globus/user*.pem names. it doesn't therefore handle host certificates but could be easily extended.

Monday, 29 October 2007

Links to monitoring pages update

I added three links to the FCR one per experiment with all the UK sites selected. It hopefully will make easier to find out who has been blacklisted.

http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages

I also added GridMap link and linked Steve monitoring both as generic dteam and atlas plus the quarters summary plots.

Monday, 15 October 2007

Availability update

Lancs site availability looks OK for the last month at 94% which is 13% above the GridPP average, and this includes a couple of weekends lost due to dCache problems. The record from July-September has been updated on Jeremy's page. We still get the occasional failed SAM submission, no idea what causes these but they serve to deny the availability reaching high nineties.

June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed.
mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs.
mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood.

Job slot occupancy looks ok, non-HEP VOs like fusion and biomed helping to fill slots left by moderate production by Atlas.

Friday, 12 October 2007

Sys Admin Requests wiki pages

YAIM has a new wiki page for sys admins requests. Maria has sent an announcement to the LCG-ROLLOUT. I added, for bookkeeping, a link and explanations in the sys admin wiki wishlist page where also the ROCs admins management tools requests is linked from.

http://www.sysadmin.hep.ac.uk/wiki/Wishlist

Tuesday, 9 October 2007

BDII doc page

After the trouble sheffield went through with the BDII I started a BDII page in the sysadmin wiki.

http://www.sysadmin.hep.ac.uk/wiki/BDII

Monday, 8 October 2007

Manchester RGMA fixed

Fixed RGMA in Manchester. It had, for still obscure reasons, wrong permissions on the host key files. Started a RGMA troubleshooting page on sysadmin wiki:

http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA

EGEE '07

EGEE conference. I've given a talk in the SA1-JRA1 session which seems to have had a positive result which will hopefully have some follow up.

Talk can be found at

http://indico.cern.ch/materialDisplay.py?contribId=30&sessionId=49&materialId=slides&confId=18714

and is the sibling of the one we gave in Stockholm at the Ops workshop on problems with SA3 and within SA1.

http://indico.cern.ch/contributionDisplay.py?contribId=25&confId=12807

which had some follow up with SA3 that can be found here

https://savannah.cern.ch/task/?5267

Sunday, 26 August 2007

Some new links about security

This article is an interesting example of how even someone with very little experience can still do some basic forensic.

http://blog.gnist.org/article.php?story=HollidayCracking

I added the link under the forensic section on the sys admin wiki

http://www.sysadmin.hep.ac.uk/wiki/Basic_Security#Forensic

Since I was at it I added a firewall section to

http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Firewall_configuration_and_Services_ports

Dcache Troubleshooting page

My tests on dcache started to fail for obscure reasons due to gsidcap doors misbehaving. I started a trouble shooting page for dcache

http://www.sysadmin.hep.ac.uk/wiki/DCache_Troubleshooting

Tuesday, 14 August 2007

GOCDB3 permission denied

I can't edit NorthGrid sites anymore. I opened a ticket.

https://gus.fzk.de/pages/ticket_details.php?ticket=25846

I would be mildly curious to know if other people are experiencing the same or if I'm the only one.

Monday, 13 August 2007

lcg_utils bug closed

Ticket about lcg_util bugs has been answered and closed

https://gus.fzk.de/pages/ticket_details.php?ticket=25406&from=allt

Correct version of rpms to install is

[aforti@niels003 aforti]$ rpm -qa GFAL-client lcg_util
lcg_util-1.5.1-1

GFAL-client-1.9.0-2

Update (2007/08/17): The problem was incorrect dependencies expressed in the meta rpms. Maarten opened a savannah bug.

https://savannah.cern.ch/bugs/?28738

Friday, 10 August 2007

Updating Glue schema

This is sort of old news as the request of updating the BDII is one month old.

To update the Glue schema you need to update the BDII on the BDII machine and on the CE and SE (dcache and classic). DPM SE uses BDII instead of globus-mds now so you should check the recipe for that.

The first problem I found was that

yum update glite-BDII

doesn't update the dependencies but only the meta-rpm. Apparently it works for apt-get but not for yum. So if you use yum you have 3 alternatives

1) yum -y update and risk to screw your machine
2) yum update and check each rpm
3) Look the list of rpms here

http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-BDII/3.0.2-12/glite-BDII-3.0.2-12.html

yum update

Reconfiguring the BDII doesn't pose a threat so you can

cd
./scripts/configure_node BDII_site

On the CE and SE... you can upgrade the CE and SE and reconfigure the nodes. But I didn't want to do that because you never know what might happen and with the farm full of jobs and the SE being dcache I don't see the point to risk it for a schema upgrade. So what follows is a simple recipe to upgrade the glue schema on CE and SE other than DPM without reconfiguing the nodes.

service globus-mds stop
yum update glue-schema
cd /opt/glue/schema
ln -s openldap-2.0 ldap
service globus-mds start

To check that it worked:

ps -afx -o etime,args | grep slapd

if your BDII is not on the CE and you find slapd instances on ports 2171-2173 it means you are running site BDIIs also on your CE and you should turn it off and remove it from the startup services.

The ldap link is needed because the schema path has changed and unless you want to edit the configuration file (/opt/globus/etc/grid-info-slapd.conf) the easiest thing is to add a link.

Most of this is in this ticket

https://gus.fzk.de/pages/ticket_details.php?ticket=24586&from=allt

including where to find the new schema documentation.

Wednesday, 8 August 2007

How to check accounting is working properly

Obviously when you look at the accounting pages at the bottom there is a graph showing running VOs, but that is not straightforward. Other two ways are

The accounting enforcement page showing sites that are not publishing and for how many days they haven't published.

http://www3.egee.cesga.es/acctenfor

which I linked from

https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Accounting

or you could setup RSS feeds as suggested in the Apel FAQ.

I also created an Apel page with this information on the sysadmin wiki

http://www.sysadmin.hep.ac.uk/wiki/Apel

Thursday, 2 August 2007

lcg-utils bugs

https://gus.fzk.de/pages/ticket_details.php?ticket=25406