<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4670756400590062347</id><updated>2012-02-07T07:31:04.879Z</updated><category term='atlas file system tweak xfs sl5 jobs efficiency manchester'/><category term='SL4'/><category term='NorthGrid'/><category term='apel nagios alert remove'/><category term='manchester dpm optimization'/><category term='atlas athena'/><category term='64bit'/><category term='manchester file systems worker nodes evaluation'/><category term='Security'/><category term='Lancaster SL4'/><category term='Manchester'/><category term='jobmanager'/><category term='file system xfs sl5 read ahead tweak kernel liverpool'/><category term='machine room'/><category term='Sheffield'/><category term='fabric'/><category term='manchester cvmfs upgrade'/><category term='Availability'/><category term='manchester dpm upgrade 1.7.4 to 1.8.2 and mysql 5.5'/><category term='Liverpool'/><category term='manchester new hardware computing nodes storage'/><category term='Lancaster'/><category term='manchester scripts system administration monitoring sharing'/><category term='manchester bdii'/><category term='Manchester HammerCloud results'/><category term='firewall'/><category term='atlas'/><category term='Manchester BDII CE'/><category term='manchester apel sl5 glite installation'/><category term='manchester squid mrtg snmp atlas monitoring'/><category term='manchester cvmfs installation'/><category term='VOMS'/><title type='text'>Northgrid-tech</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default?start-index=101&amp;max-results=100'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>134</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5499049933033260333</id><published>2012-01-14T12:09:00.004Z</published><updated>2012-01-15T08:51:12.204Z</updated><title type='text'>DPM database file systems synchronization</title><content type='html'>The synchronisation of the DPM database with the data servers file systems has been a long standing issue.&amp;nbsp; Last week we had a crash that made more imperative to check all the files and I eventually wrote a bash script that makes use of the &lt;a href="https://www.gridpp.ac.uk/wiki/DPM-admin-tools#GridPP_DPM_administration_toolkit"&gt;GridPP DPM admin tools.&lt;/a&gt; I don't think this should be the final version but I'm quicker with bash than with python and therefore I&amp;nbsp; started with that. Hopefully later in the year I'll have more time to write a cleaner version in python that can be inserted in the admin tools based on this one. It does the following:&lt;br /&gt;&lt;br /&gt;1) Create a list of files that are in the DB but not on disk&lt;br /&gt;2) Create a list of files that are on disk but not in the DB&lt;br /&gt;3) Create a list of SURLs from the list of files in the DB but not on disk to declare lost (this is mostly for atlas but could be used by LFC administrators for other VOs)&lt;br /&gt;4) If not in dry run mode proceed to delete the orphan files and the orphan entries in the DB. &lt;br /&gt;5) Print stats of how many files were in either list.&lt;br /&gt;&lt;br /&gt;Although I put few protections this script should be run with care and &lt;b&gt;unless in dry run mode&lt;/b&gt; shouldn't be run automatically &lt;b&gt;AT ALL&lt;/b&gt;. However in dry run mode it will tell you how many files are lost and it is a good metric to monitor regularly as well as when there is a big crash.&lt;br /&gt;&lt;br /&gt;If you want to run it, it has to run on the data servers where there is access to the file system. As it is now it requires a modified version of /opt/lcg/etc/DPMINFO that point to the head node rather than localhost because one of the admin tools used does a direct mysql query. For the same reason it also requires &lt;b&gt;dpminfo user&lt;/b&gt; to have mysql select privileges from the data servers. This is the part that really could benefit from a rewriting in python and perhaps a proper API use as the other tool does. I also had to heavily parse the output of the tools which weren't created exactly for this purpose and this could also be avoided in a python script. There are no options but all the variables that could be options to customize the script with your local settings (head node, fs mount point, dry_run) are easily found at the top.&lt;br /&gt;&lt;br /&gt;To create the lists it takes really little time no more than 3 minutes on my system but it depends mostly on how busy is your head node.&lt;br /&gt;&lt;br /&gt;If you want to do a cleanup instead it is proportional to how many files have been lost and can take several hours since it does one DB operation per file. The time to delete the orphan files also depends on how many and how big they are but should take less than DB cleanup.&lt;br /&gt;&lt;br /&gt;The script is here: &lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-synchronise-disk-db.sh"&gt;http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-synchronise-disk-db.sh&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5499049933033260333?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5499049933033260333/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5499049933033260333' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5499049933033260333'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5499049933033260333'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2012/01/dpm-database-file-systems.html' title='DPM database file systems synchronization'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-755506275777766572</id><published>2011-11-30T11:05:00.032Z</published><updated>2011-12-01T11:21:48.805Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester dpm upgrade 1.7.4 to 1.8.2 and mysql 5.5'/><title type='text'>DPM upgrade 1.7.4 -&gt; 1.8.2 (glite 3.2)</title><content type='html'>Last week I upgraded our DPM installation. It was a major change because I upgraded not only the DPM version but also the hardware and the backend mysql version.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;br /&gt;I didn't take any measures this time before and after. I knew that becoming an alpha site in atlas was taking its toll on the old hardware and many of the timeouts were from gridftp but there had been a reappearance of the mysql ones I talked about in &lt;a href="http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html"&gt;previous posts&lt;/a&gt; at the level that even restarting the service was hard.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;[ ~]# service mysqld restart &lt;/span&gt; &lt;span style="font-weight: bold;font-size:100%;" &gt;&lt;br /&gt;Timeout error occurred trying to stop MySQL Daemon. &lt;/span&gt; &lt;span style="font-weight: bold;font-size:100%;" &gt;&lt;br /&gt;Stopping MySQL:                                            [FAILED] &lt;/span&gt; &lt;span style="font-weight: bold;font-size:100%;" &gt;&lt;br /&gt;Timeout error occurred trying to start MySQL Daemon.  &lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;So I decided that the situation had become unsustainable and it was time to move to better hardware and software versions.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;* Hardware:&lt;/span&gt; 2 cpu, 4GB mem, 2x250 GB raid1 -&amp;gt; 4 cores (HT on = 8 job slots), 24GB mem, 2x2TB raid1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is no why here it was ok when we had limited access but the recent  load was really too much for the old machine even with all the tuning.  Suspected bad blocks on disks could be possible but no red leds nor  hardware errors were reported by the machine.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;* Mysql: &lt;/span&gt;5.0.77 -&amp;gt; 5.5.10 &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Why mysql 5.5? Because InnoDB is the default engine and they have  improved performance and instrumentation. On top of other things that we  might actually start to use. A good blog article about the 5 reasons to  move is this one: &lt;a href="http://ronaldbradford.com/blog/five-reasons-to-upgrade-to-mysql-5-5-2010-12-15/"&gt;5 good reasons to upgrade to mysql 5.5&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;MySQL 5.5 is not in EPEL yet, but I found this CentOS  community site that has the &lt;a href="http://www.webtatic.com/packages/mysql55/"&gt;rpms and the instructions to install them&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;After the installation I've also optimized the database partially with what I had already &lt;a href="http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html"&gt;done in July&lt;/a&gt;, partly running a handy script &lt;a href="http://www.techerator.com/2011/08/optimize-your-mysql-server-with-the-mysql-tuner-script/"&gt;mysqltuner.pl&lt;/a&gt;.  This last one helps with variable you might not even know and even if  you know them it tells you if they are too small. You need to be patient  and let pass few hours before run it again.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;* DPM:&lt;/span&gt; 1.7.4 -&amp;gt; 1.8.2&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Why DPM 1.8.2 from glite 3.2? I would have gone for the UMD release or  even the EMI one but then glite 3.2 was moved to production earlier than  those and since I waited for this release since at least April I didn't  think about it twice when I saw the escape route. It was really good timing too as it happened when I really couldn't postpone an upgrade anymore. You can find more info in the &lt;a href="http://glite.cern.ch/R3.2/sl5_x86_64/glite-SE_dpm_mysql/1.8.2-3/"&gt;release notes&lt;/a&gt;. Among other reasons to upgrade: &lt;a href="https://savannah.cern.ch/bugs/?71041"&gt;srmv2.2 in 1.7.4 has a memory leak&lt;/a&gt; which wasn't noticeable until the load was contained but for us exploded in October and is the reason I had to restart it every two days in the past few weeks.&lt;br /&gt;&lt;br /&gt;Below the steps I took to reinstall the head node&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;On the old head node&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;* Set the site in downtime, drain the queues and kill all the remaining jobs.&lt;br /&gt;&lt;br /&gt;* Turn off all the dpm and bdii services on the old head node&lt;br /&gt;&lt;br /&gt;* Make a dump of the current database for backup&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;mysqldump -C -Q -u root -p -B dpm_db cns_db &amp;gt; dpm.sql-20111125.gz&lt;/span&gt;&lt;span style="font-size:85%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;* Download dpm-drop-requests-tables.sql supplied by Jean Philippe last July&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;wget http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-drop-requests-tables.sql&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;* Drop the requests tables. This step is really useful to avoid painful reload times as I said in &lt;a href="http://northgrid-tech.blogspot.com/2011/06/dpm-optimization.html"&gt;this other post about DPM optimization&lt;/a&gt; and because it drastically reduces the size of ibdata1 when you reload which has also benefits (my ibdata1 was reduced from 26GB to 1.7GB). Still you need to plan because it might take few hours depending on the system. On my old hardware it took around 7 hours.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;mysql -p &amp;lt; dpm-drop-requests-tables.sql  &lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;* Dump reduced version of the database  &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;mysqldump -C -Q -u root -p -B dpm_db cns_db &amp;gt; dpm.sql-20111125-v2.gz&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Copy both to a WEB server where they can be downloaded from in a later stage.&lt;br /&gt;&lt;br /&gt;* Update the local repository for DPM head node and DPM disk servers. Since it is still glite I just had to rsync the latest mirror to the static area.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;On the new head node&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;* Install the new machines with a DPM head node profile. This was again easy since it is still glite no changes were required in cfengine.&lt;br /&gt;&lt;br /&gt;* Most of the following is not standard and I put it in a script. If you have problems with users IDs created by &lt;span style="font-weight: bold;"&gt;avahi&lt;/span&gt; packages you can uninstall them with yum removing all the dependencies and let them be reinstalled by the bdii dependency chain. It should work also uninstalling them with &lt;span style="font-weight: bold;"&gt;rpm -e --nodeps&lt;/span&gt;. This leaves &lt;span style="font-weight: bold;"&gt;redhat-lsb&lt;/span&gt; (which is what the bdii depends on) untouched but I haven't tried this last method. Here are the commands I executed:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;# Get the dpm DB file&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;rm -rf dpm.sql-20111125-v2.gz*&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;wget http://ks.tier2.hep.manchester.ac.uk/T2/tmp/dpm.sql-20111125-v2.gz&lt;/span&gt;  &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;# Install mysql5.5&lt;br /&gt;&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;yum -y remove libmysqlclient5 mysql mysql-*&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;yum -y clean all&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;yum -y install mysql55 mysql55-server libmysqlclient5 --enablerepo=webtatic&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;service mysql stop&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;rm -rf /var/lib/mysql/*&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;# Get the local my.cnf&lt;br /&gt;cfagent -vq&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;service mysqld start&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;# Install the DPM rpms&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;yum -y remove cups avahi avahi-compat-libdns_sd avahi-glib &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;yum -y install glite-SE_dpm_mysql lcg-CA&lt;/span&gt;  &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;# Modify sql scripts for mysql5.5&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;cd&lt;br /&gt;/opt/lcg/share/DPM/&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;for a in create_dp*.sql; do sed -i.old 's/TYPE/ENGINE/g' $a;done&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;grep ENGINE *&lt;/span&gt;  &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;# Run YAIM and upload old DB &lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;cd&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql&lt;/span&gt;  &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;mysql -u root -p -C &amp;lt; /root/dpm.sql-20111125-v2.gz &lt;/span&gt;  &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;# NECESSARY FOR THE FINAL UPDATES&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql&lt;/span&gt;  &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* You will need to install the dpm-contrib-admintool rm because it is not in the glite repository it might be in the EMI one. Last time I heard it made it to ETICS. If you can't find it there's still the &lt;a href="http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/"&gt;sysadmin repo version&lt;/a&gt; and related notes on the &lt;a href="https://www.gridpp.ac.uk/wiki/DPM-admin-tools#GridPP_DPM_administration_toolkit"&gt;GridPP wiki&lt;/a&gt; (Sam or Wahid welcome to leave an update on this one).&lt;br /&gt;&lt;br /&gt;* To upgrade the disk servers I just updated the repository, upgraded the rpms and rerun yaim.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-755506275777766572?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/755506275777766572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=755506275777766572' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/755506275777766572'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/755506275777766572'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/11/dpm-upgrade-174-182-glite-32.html' title='DPM upgrade 1.7.4 -&gt; 1.8.2 (glite 3.2)'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-779340858148694419</id><published>2011-09-28T14:40:00.006Z</published><updated>2011-09-28T14:50:46.547Z</updated><title type='text'>10 Years Of GridPP: I was there. And you?</title><content type='html'>&lt;a href="http://www.gridpp.ac.uk/gridpp1/" title="GridPP 1"&gt;&lt;img src="http://www.gridpp.ac.uk/pics/gridpp-group.jpg" width="500" height="375" alt="Door To Madness"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-779340858148694419?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/779340858148694419/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=779340858148694419' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/779340858148694419'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/779340858148694419'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/09/10-years-of-gridpp-i-was-there-and-you.html' title='10 Years Of GridPP: I was there. And you?'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2305746187189553152</id><published>2011-09-09T09:21:00.014Z</published><updated>2011-09-09T12:06:36.884Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester cvmfs upgrade'/><title type='text'>cvmfs upgrade to 2.0.3</title><content type='html'>Last week I upgraded the cvmfs on all the WN to cvmfs-2.0.3. The upgrade for us required two steps.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1) change of repository:&lt;/span&gt; since Manchester was the first to use the new atlas setup we were pointing to CERN repository. The new setup has now become standard so I just had to remove the override variable CVMFS_SERVER_URL from atlas.cern.ch.local. The file is distributed by cfengine so I just changed it in cvs. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;2) rpms upgrade:&lt;/span&gt; I had some initial difficulties because I was following the instructions for atlas T3 - which normally work also for T2 - that suggested to install &lt;span style="font-weight:bold;"&gt;cvmfs-auto-setup&lt;/span&gt; rpm. This rpm runs &lt;span style="font-weight:bold;"&gt;service cvmfs restartautofs&lt;/span&gt; and in the instructions it was suggested also to rerun it manually. This on busy machines causes the repositories to disappear and requires a &lt;span style="font-weight:bold;"&gt;service cvmfs restartclean&lt;/span&gt; which wipes the cache off and is not really recommended in production. In reality none of this is really necessary and a simple&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;yum -y update cvmfs cvmfs-init-scripts&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;is sufficient. I could add the rpms version in cfengine and that was enough. The change from one version to another happens at the first unmount. Forcing this with a restartautofs is counterproductive (thanks to Ian for pointing this out).  &lt;br /&gt;&lt;br /&gt;Next week there should be a bug fix version that will take care of slow mount and some slow client tools routines on busy machines. &lt;br /&gt;&lt;br /&gt;&lt;a href="http://savannah.cern.ch/bugs/?86349"&gt;http://savannah.cern.ch/bugs/?86349&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;But since the upgrade procedure is so easy and the corrupted files problem &lt;br /&gt;&lt;br /&gt;&lt;a href="http://savannah.cern.ch/support/?122564"&gt;http://savannah.cern.ch/support/?122564&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;is fixed in cvmfs &gt;2.0.2 I decided to upgrade anyway on Wednesday to avoid further errors in atlas and possibly lhcb.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;NOTE:&lt;/span&gt; Of course I tested each step on few nodes to check everything worked before rolling out with cfengine on all nodes. Always a good practice not to follow recipes blindly!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2305746187189553152?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2305746187189553152/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2305746187189553152' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2305746187189553152'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2305746187189553152'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/09/cvmfs-upgrade-to-203.html' title='cvmfs upgrade to 2.0.3'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2331111094368022698</id><published>2011-07-06T15:52:00.045Z</published><updated>2011-07-13T22:10:20.362Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester cvmfs installation'/><title type='text'>cvmfs installation</title><content type='html'>Last week after few months delay I finally installed cvmfs. It's since 2002-2003 that I advocate the &lt;a href="http://www.slac.stanford.edu/econf/C0303241/proc/papers/MOAT011.PDF"&gt;use of a shared file system&lt;/a&gt; for the input sandbox with locally cached data. AFS was successfully used in grid and non grid environment by BaBar users and is still used by local non-LHC users in Manchester for small work. So I'm pretty happy that a light weight caching file system is now available for more robust traffic. This is a really good moment to install cvmfs for two reasons:&lt;br /&gt;&lt;br /&gt;1) Lhcb asked for it too.&lt;br /&gt;2) Atlas has moved its condb files from the HOTDISK space token to cvmfs. &lt;br /&gt;&lt;br /&gt;And it should reduce drastically errors for both NFS and SE load. &lt;br /&gt;&lt;br /&gt;These are my installation notes:&lt;br /&gt;&lt;br /&gt;* Install cernvm.repo: you can find it &lt;a href="http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo"&gt;here&lt;/a&gt; or you can copy the rpms in your local and install from there. I distribute the file with cfengine but otherwise&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;cd /etc/yum.repos.d/&lt;br /&gt;wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo&lt;/span&gt;&lt;br /&gt;      &lt;br /&gt;* Install the gpg key: yum didn't like the key and was giving errors. I don't know if the problem is only mine (possible) I anyway told the developers and in the meantime I had to remove the key check from the repo file and trust the rpms. But if you want to try it, it might work for you:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;cd  /etc/pki/rpm-gpg/&lt;br /&gt;wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Install the rpms. In the documents there is an additional rpm cvmfs-auto-setup which is not really necessary and was also causing problems due to some migration lines devised for upgrades. Other than that it runs a setup and a restart command that can be run by your configuration tool of choice. S. Traylen also suggested to install SL_no_colorls to avoid ls /cvmfs mounting all the file systems that's why it's in the list.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;yum install -y fuse cvmfs−keys cvmfs cvmfs−init−scripts SL_no_colorls&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Install configuration files. Below is what I added. For atlas there is in the docs a nightlies repository but that's not ready yet and isn't going to work. The default QUOTA_LIMIT set in default.local can be overridden in the experiment configuration. For each of this files there is a &lt;span style="font-weight:bold;"&gt;.conf&lt;/span&gt; file and a &lt;span style="font-weight:bold;"&gt;.local&lt;/span&gt; you should edit only &lt;span style="font-weight:bold;"&gt;.local&lt;/span&gt;. If they are not there just create them.&lt;br /&gt;You need to override the CVMFS_SERVER_URL for atlas otherwise you don't get the new setup. While in cern.ch.local I simply inverted the order of the server to get RAL first and then the other two if RAL fails. I also removed CERNVM_SERVER_URL which appears in cern.ch.conf otherwise it goes to CERN first even though it's not apparently defined anywhere.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;/etc/cvmfs/default.local &lt;br /&gt;CVMFS_REPOSITORIES=atlas,atlas-condb,lhcb&lt;br /&gt;CVMFS_CACHE_BASE=/scratch/var/cache/cvmfs2&lt;br /&gt;CVMFS_QUOTA_LIMIT=2000&lt;br /&gt;CVMFS_HTTP_PROXY="http://[YOUR-SQUID-CACHE]:3128"&lt;br /&gt;&lt;br /&gt;/etc/cvmfs/config.d/atlas.cern.ch.local &lt;br /&gt;CVMFS_QUOTA_LIMIT=10000&lt;br /&gt;CVMFS_SERVER_URL=http://cvmfs-stratum-one.cern.ch/opt/atlas-newns&lt;br /&gt;&lt;br /&gt;/etc/cvmfs/config.d/lhcb.cern.ch.local &lt;br /&gt;CVMFS_QUOTA_LIMIT=5000&lt;br /&gt;&lt;br /&gt;/etc/cvmfs/domain.d/cern.ch.local&lt;br /&gt;CVMFS_SERVER_URL="http://cernvmfs.gridpp.rl.ac.uk/opt/@org@;http://cvmfs-stratum-one.cern.ch/opt/@org@;http://cvmfs.racf.bnl.gov/opt/@org@"&lt;br /&gt;CVMFS_PUBLIC_KEY=/etc/cvmfs/keys/cern.ch.pub&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Create the cache space. By default it's in /var/cache. However I moved it to the /scratch partition which is bigger.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;mkdir -p /scratch/var/cache/cvmfs2&lt;br /&gt;chown cvmfs:cvmfs /scratch/var/cache/cvmfs2&lt;br /&gt;chmod 2755 /scratch/var/cache/cvmfs2 &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Run the setup. These are the commands the cvmfs-auto-setup would run at installation time. They also configure fuse although that's only one line added to fuse.conf.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;/usr/bin/cvmfs_config setup&lt;br /&gt;service cvmfs restartautofs&lt;br /&gt;&lt;br /&gt;chkconfig cvmfs on&lt;br /&gt;service cvmfs restart&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Some parameters need to change for squid. Below is what the documentation suggests. I tuned it to the size of my machine. For example the maximum_object_size and cache_mem were too big and I checked which other parameters were already set to evaluate if it was the case to change them.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;collapsed_forwarding on&lt;br /&gt;max_filedesc 8192&lt;br /&gt;maximum_object_size 4096 MB&lt;br /&gt;cache_mem 4096 MB&lt;br /&gt;maximum_object_size_in_memory 32 KB&lt;br /&gt;cache_dir ufs /var/spool/squid 50000 16 256&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Apply changes for &lt;span style="font-weight:bold;"&gt;Lhcb&lt;/span&gt; the VO_LHCB_SW_DIR needs to point to cvmfs. You can change it in YAIM and rerun it or you can do as I've done (still making sure to change YAIM so that freshly installed nodes don't need this hack). Lhcb with this change is good to go.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;sed -i.sed.bak 's%/nfs/lhcb%/cvmfs/lhcb.cern.ch%' /etc/profile.d/grid-env.sh&lt;br /&gt;mv /etc/profile.d/grid-env.sh.sed.bak /root&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;* Apply changes for &lt;span style="font-weight:bold;"&gt;Atlas&lt;/span&gt;. A similar change to VO_ATLAS_SW_DIR is required and you need to set an additional variable that is not handled by YAIM. For now I added it to grid-env.sh but it be better placed in another file not touched by YAIM or a snippet should be added to YAIM to handle the variable. This is enough for the jobs to start using the software area. However you still have to contact the atlas sw team to do their validation tests and enable the condb use. They'll propose a long way and a short way. I took the short because I didn't want to go in downtime and jobs were already running using the new setup. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;sed -i.sed.2 's%"/nfs/atlas"%"/cvmfs/atlas.cern.ch/repo/sw"\ngridenv_set         "ATLAS_LOCAL_AREA" "/nfs/atlas/local"%' /etc/profile.d/grid-env.sh&lt;br /&gt;mv /etc/profile.d/grid-env.sh.sed.bak /root&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Always for &lt;span style="font-weight:bold;"&gt;Atlas&lt;/span&gt; remove some installed &lt;span style="font-weight:bold;"&gt;.conf&lt;/span&gt; files which install a link in /opt which is not necessary anymore. Second file might not exist, but there is an atlas-nightly.cern.ch.conf. This will surely change in future cvmfs releases.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;service cvmfs stop&lt;br /&gt;rm /etc/cvmfs/config.d/atlas.cern.ch.conf&lt;br /&gt;rm /etc/cvmfs/config.d/atlas-condb.cern.ch.conf&lt;br /&gt;service cvmfs start&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Update 12/7/2011: Using YAIM&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;cfengine only installs the rpms and the configuration files (*.local). All the rest is now carried out by a YAIM function I created (config_cvmfs). I put a tar file &lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/cvmfs/cvmfs-yaim.tar"&gt;here&lt;/a&gt;.To make it work I also  added a node description in node-info.d/cvmfs (also in the tar file) that contains it. In this way I don't have to touch any already existing YAIM files and I can just add -n CVMFS to the YAIM command line we use to configure the WNs. It requires ATLAS_LOCAL_AREA and CVMFS_CACHE_DIR variables to be set in your site-info.def.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;CVMFS docs are here&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://cernvm.cern.ch/portal/node/126"&gt;Release Notes&lt;/a&gt;&lt;br /&gt;&lt;a href="http://cernvm.cern.ch/portal/node/127"&gt;Init Scripts Overview&lt;/a&gt;&lt;br /&gt;&lt;a href="http://cernvm.cern.ch/portal/node/123"&gt;Examples&lt;/a&gt;&lt;br /&gt;&lt;a href="https://cernvm.cern.ch/project/trac/downloads/cernvm/cvmfstech-0.2.70-1.pdf"&gt;Technical Report&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.gridpp.ac.uk/wiki/RAL_Tier1_CVMFS"&gt;RAL T1&lt;/a&gt;&lt;br /&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/Atlas/Tier3CVMFS2SLC5"&gt;Atlas T2/T3 setup&lt;/a&gt;&lt;br /&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/Atlas/CernVMFS#Changes_to_CVMFS_Client_Setup_an"&gt;Atlas latest changes&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2331111094368022698?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2331111094368022698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2331111094368022698' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2331111094368022698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2331111094368022698'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/07/cvmfs-installation.html' title='cvmfs installation'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7477550209853345563</id><published>2011-06-22T13:01:00.006Z</published><updated>2011-07-11T23:46:46.908Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='apel nagios alert remove'/><title type='text'>How to remove apel warnings and avoid nagios alerts</title><content type='html'>Quite few sites have few entries in APEL that don't quite match. They can appear with two messages&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;OK [ Minor discrepancy in even numbers ]&lt;br /&gt;WARN [ Missing data detected ]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;They don't look good on the Sync page and nagios also sends alerts for this problem which is even more annoying.&lt;br /&gt;&lt;br /&gt;The problem is caused by few records with the wrong time stamp (StartTime=01-01-1970). These records need to be deleted from the local database and the period were they appear republished with the gap publisher. To delete the records connect to your local APEL mysql and run:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;mysql&gt; delete from LcgRecords where StartTimeEpoch = 0;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Then for each month were the entries appear rerun the gap publisher. And finally rerun the publisher in missing records mode to update the SYNC page or you can wait the next proper run if you are not impatient.&lt;br /&gt;&lt;br /&gt;Thanks to Cristina for this useful tip she gave me in &lt;a href="https://ggus.eu/ws/ticket_info.php?ticket=70801"&gt;this ticket&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7477550209853345563?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7477550209853345563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7477550209853345563' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7477550209853345563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7477550209853345563'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/06/how-to-remove-apel-warnings-and-avoid.html' title='How to remove apel warnings and avoid nagios alerts'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2793338616336205351</id><published>2011-06-14T15:46:00.024Z</published><updated>2011-06-23T10:22:05.104Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester dpm optimization'/><title type='text'>DPM optimization next round</title><content type='html'>After I applied 3 of the mysql parameters changes I talk about in &lt;a href="http://northgrid-tech.blogspot.com/2011/06/dpm-optimization.html"&gt;this post&lt;/a&gt; I didn't see the improvement I was hoping with atlas jobs time outs.&lt;br /&gt;&lt;br /&gt;This is another set of optimizations I put together after further search&lt;br /&gt;&lt;br /&gt;First of all I started to systematically count the time TIME_WAIT connections every five minutes. I also correlated them in the same log file to the number of concurrent threads the server keeps mostly in sleep mode. You can get the last bit running &lt;span style="font-weight:bold;"&gt;mysqladmin -p proc stat&lt;/span&gt; or from within a mysql command line. The number of threads was near to the max allowed default value in mysql so I doubled that in my.cnf&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;max_connections=200&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;then I halved the kernel time out for TIME_WAIT connections &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;sysctl -w net.ipv4.tcp_fin_timeout=30&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;the default value is 60 sec. If you add it to /etc/sysctl.conf it becomes permanent.&lt;br /&gt;&lt;br /&gt;Finally I found this article which explicitly talks about mysql tunings to reduce connection timeouts: &lt;a href="http://www.mysqlperformanceblog.com/2011/04/19/mysql-connection-timeouts/"&gt;Mysql Connection Timeouts&lt;/a&gt; and I set the following&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;sysctl -w net.ipv4.tcp_max_syn_backlog=8192&lt;br /&gt;sysctl -w net.core.somaxconn=512&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;again add to /etc/sysctl.conf to make it permanent; and added in my.cnf&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;back_log=500&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I calculated my numbers on 500 connections/s because that's what I have observed when I did all this (I obeserved even larger numbers). Admittedly now they are stable at 330 connections per second but we haven't had any heavy ramp up since Saturday. Only a mild one but that didn't cause any time out. I'm waiting for a serious ramp as definitive test. Said that since Saturday we haven't seen any timeout errors not even the low background that was always present. So there is already an improvement.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Update 16/06/2011&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Today there was an atlas &lt;a href="http://ks.tier2.hep.manchester.ac.uk/T2/atlas/20110616-atlas-jobs.png"&gt;ramp from almost 0 to &gt;1400 jobs&lt;/a&gt; and no time outs so far.&lt;br /&gt;&lt;br /&gt;Few timeouts were seen yesterday but they were due to authentication between the head node and a couple of data servers which I will have to investigate but they are a handful, nowhere near the scale observed before and not due to mysql. I will still keep things under observation for a while longer. Just in case.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2793338616336205351?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2793338616336205351/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2793338616336205351' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2793338616336205351'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2793338616336205351'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/06/dpm-optimization-next-round.html' title='DPM optimization next round'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-9170509917256537238</id><published>2011-06-10T07:32:00.032Z</published><updated>2011-06-11T16:51:57.645Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester dpm optimization'/><title type='text'>DPM Optimization</title><content type='html'>My quest to optimize DPM  continues. Bottlenecks are like Russian dolls and hide behind each other. After optimizing the data servers increasing the &lt;a href="http://northgrid-tech.blogspot.com/2010/08/tuning-areca-raid-controllers-for-xfs.html"&gt;block device read ahead&lt;/a&gt;; &lt;a href="www.gridpp.ac.uk/gridpp24/ChannelBonding.pdf"&gt;enabling lacp&lt;/a&gt; on network channel bonding and multiplying the atlas hotdisk files there is still a problem with mysql on the head node which causes time outs.&lt;br /&gt;&lt;br /&gt;When atlas ramps up there is often a increase of connection in TIME_WAIT. I observed &gt;2600 at times. The mysql database becomes completely unresponsive and causes the time outs. Restarting the database causes the connections to finally close and the database to resume normal activity. Although a restart might alleviate the problem as usual it's not a cure. So I went on a quest. What follows might not alleviate my specific problem, I haven't tested in production yet, but it certainly helps with another: DB reload. &lt;br /&gt;&lt;br /&gt;Sam already wrote some performance tuning tips here: &lt;a href="http://www.gridpp.ac.uk/wiki/Performance_and_Tuning"&gt;Performance and Tuning&lt;/a&gt; most notably the setting of &lt;span style="font-style:italic;"&gt;innodb_buffer_pool_size&lt;/span&gt;. After a discussion on the DPM user forum and some testing this is what I'd add:&lt;br /&gt;&lt;br /&gt;I set "DPM     REQCLEAN        3m" when I upgraded to DPM 1.7.4 and this, after a reload, has reduced Manchester DB file size from 17GB to 7.6GB. Dumping the db took 7m34s. I then reloaded it with different combinations of suggested my.cnf &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html"&gt;innodb parameters&lt;/a&gt; and the effects of some of them are dramatic.&lt;br /&gt;&lt;br /&gt;The default parameters should definitely be avoided. Reloading a database with the default parameters takes several hours. Last time it took 17-18 hours, this time I interrupted after 4.&lt;br /&gt;&lt;br /&gt;With a combination of the parameters suggested by Maarten the time is drastically reduced. In particular the most effective have been setting &lt;span style="font-style:italic;"&gt;innodb_buffer_pool_size&lt;/span&gt; and &lt;span style="font-style:italic;"&gt;innodb_log_file_size&lt;/span&gt;. Below are the results of the upload tests I made in decreasing order of time. I then followed Jean Philippe suggestion to drop the requests tables. Dropping the tables took several minutes and it was slightly faster with a single db file. After I dropped the tables and the indexes ibdata1 size dropped to 1.2GB and using combination 4 below it took &lt;span style="font-weight:bold;"&gt;1m7s to dump and 5m7s to reload&lt;/span&gt;. With one file per table configuration reloading was slightly faster but after I dropped the requests tables there was no difference and it is also balanced by the fact that deletion seems slower and the effects are probably more visible when the database is bigger so these small tests don't give any compelling reason in favour nor against for now.&lt;br /&gt;&lt;br /&gt;This are steps that help reducing the time it takes to reload the database:&lt;br /&gt;&lt;br /&gt;1) Enable &lt;span style="font-style:italic;"&gt;REQCLEAN&lt;/span&gt; in shift.conf (I set it to 3 months to comply with security requirements.)&lt;br /&gt;2) set &lt;span style="font-style:italic;"&gt;innodb_buffer_pool_size&lt;/span&gt; in my.cnf (I set it at 10% of the machine memory and I couldn't see much difference eventually when I set it to 22.5% but in production it might be another story with repeated queries for the same input files)&lt;br /&gt;3) set &lt;span style="font-style:italic;"&gt;innodb_log_file_size&lt;/span&gt;  in my.cnf (didn't give much thought to this, Maarten value of 50MB seemed good enough. &lt;a href="http://dev.mysql.com/doc/refman/5.0/en/binary-log.html"&gt;Binary log files&lt;/a&gt; need to be removed to enable this and the database restarted but check the docs this might not be a valid strategy if you make heavier use of the binary logs.)&lt;br /&gt;4) set &lt;span style="font-style:italic;"&gt;innodb_flush_log_at_trx_commit = 2&lt;/span&gt; in my.cnf (although this parameter seems less effective during reload it might be useful in production 2 is slightly safer than 0).&lt;br /&gt;5) Use the &lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-drop-requests-tables.sql"&gt;script&lt;/a&gt; Jean-Philippe gave me to drop the requests tables before an upgrade.&lt;br /&gt;&lt;br /&gt;Hopefully they will help stop also the time outs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Tests:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;&lt;br /&gt;COMBINATION 1&lt;br /&gt;&lt;br /&gt;innodb_buffer_pool_size = 400MB&lt;br /&gt;# innodb_log_file_size = 50MB&lt;br /&gt;innodb_flush_log_at_trx_commit = 2&lt;br /&gt;# innodb_file_per_table&lt;br /&gt;&lt;br /&gt;real    167m30.226s&lt;br /&gt;user    1m41.860s&lt;br /&gt;sys    0m9.987s&lt;br /&gt;&lt;br /&gt;============================&lt;br /&gt;COMBINATION 2&lt;br /&gt;innodb_buffer_pool_size = 900MB&lt;br /&gt;# innodb_log_file_size = 50MB&lt;br /&gt;# innodb_flush_log_at_trx_commit = 2&lt;br /&gt;# innodb_file_per_table&lt;br /&gt;   &lt;br /&gt;real    155m2.996s&lt;br /&gt;user    1m40.843s&lt;br /&gt;sys    0m9.935s&lt;br /&gt;&lt;br /&gt;===========================&lt;br /&gt;COMBINATION 3&lt;br /&gt;innodb_buffer_pool_size = 900MB&lt;br /&gt;innodb_log_file_size = 50MB&lt;br /&gt;# innodb_flush_log_at_trx_commit = 2&lt;br /&gt;# innodb_file_per_table&lt;br /&gt;&lt;br /&gt;real    49m2.683s&lt;br /&gt;user    1m39.137s&lt;br /&gt;sys    0m9.902s&lt;br /&gt;===========================&lt;br /&gt;COMBINATION 4&lt;br /&gt;innodb_buffer_pool_size = 400MB&lt;br /&gt;innodb_log_file_size = 50MB&lt;br /&gt;innodb_flush_log_at_trx_commit = 2 &lt;-- test also with 0 instead of 2 but it didn't change the time it took and 2 is slightly safer&lt;br /&gt;# innodb_file_per_table&lt;br /&gt;&lt;br /&gt;real    48m32.398s&lt;br /&gt;user    1m40.638s&lt;br /&gt;sys    0m9.733s&lt;br /&gt;===========================&lt;br /&gt;COMBINATION 5&lt;br /&gt;innodb_buffer_pool_size = 900MB&lt;br /&gt;innodb_log_file_size = 50MB&lt;br /&gt;innodb_flush_log_at_trx_commit = 2&lt;br /&gt;innodb_file_per_table&lt;br /&gt;&lt;br /&gt;real    47m25.109s&lt;br /&gt;user    1m39.230s&lt;br /&gt;sys    0m9.985s&lt;br /&gt;===========================&lt;br /&gt;COMBINATION 6&lt;br /&gt;innodb_buffer_pool_size = 400MB&lt;br /&gt;innodb_log_file_size = 50MB&lt;br /&gt;innodb_flush_log_at_trx_commit = 2&lt;br /&gt;innodb_file_per_table&lt;br /&gt;&lt;br /&gt;real    46m46.850s&lt;br /&gt;user    1m40.378s&lt;br /&gt;sys    0m9.950s&lt;br /&gt;===========================&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-9170509917256537238?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/9170509917256537238/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=9170509917256537238' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9170509917256537238'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9170509917256537238'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/06/dpm-optimization.html' title='DPM Optimization'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3355555367352141561</id><published>2011-05-20T07:02:00.014Z</published><updated>2011-05-20T08:05:20.684Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester bdii'/><title type='text'>BDII again</title><content type='html'>A couple of weeks ago I upgraded the site BDII and top BDII from a very old version without reinstalling as described &lt;a href="http://northgrid-tech.blogspot.com/2011/05/bdii-follow-up.html"&gt;in this post&lt;/a&gt;. Few days ago I noticed that not all was working as well as I thought and the BDII was reporting stale numbers in the dynamic attributes causing few problems among which biomed submitting an unhealthy 12k jobs. &lt;br /&gt;&lt;br /&gt;There were two reasons for this:&lt;br /&gt;&lt;br /&gt;1) the unprivileged user that runs the BDII is edguser anymore but ldap. Consequently there were some ownership issues in /opt/glite/var subdirectories and files. This was highlighted in &lt;span style="font-weight:bold;"&gt;/var/log/bdii/bdii-update.log&lt;/span&gt; by permission denied errors which I overlooked for a bit too long. Permissions should be as follow: &lt;span style="font-weight:bold;"&gt;/opt/glite/var /opt/glite/var/lock, /opt/glite/var/tmp and /opt/glite/var/cache&lt;/span&gt; should belong to root and anything below them should belong to ldap. You can check if there is anything that doesn't belong to ldap running &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;find /opt/glite/var/ ! -user ldap -ls&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;this will include the top directories above which you can ignore.&lt;br /&gt;&lt;br /&gt;2) bdii-update doesn't use anymore glite-info-wrapper and glite-info-generic which used to write the .ldif files in the same directory tree above. It now writes what it needs in /var/run/bdii databases and one unique file new.ldif file calling directly the scripts in &lt;span style="font-weight:bold;"&gt;/opt/glite/etc/gip/provider&lt;/span&gt; and &lt;span style="font-weight:bold;"&gt;/opt/glite/etc/gip/plugin&lt;/span&gt;. I upgraded from an older version and the old providers weren't deleted but continued to be executed by bdii-update. Some of them still read what now are obsolete .ldif.&lt;chksum&gt; files under &lt;span style="font-weight:bold;"&gt;/opt/glite/var/cache&lt;/span&gt; tree. I deleted all the .ldif files with an additional numeric extension under /opt/glite/var.&lt;br /&gt;&lt;br /&gt;With these two changes, i.e. fixing the ownership of the directories and deleting osolete .ldif files (or the old providers if one is sure of which ones) the site bdii restarted to update correctly the dynamic attributes.&lt;br /&gt;&lt;br /&gt;Finally a note on making it easier to reinstall: in the previous post I suggested to add manually &lt;span style="font-weight:bold;"&gt;SLAPD=/usr/sbin/slapd2.4&lt;/span&gt; to change slapd version to the newly installed &lt;span style="font-weight:bold;"&gt;/opt/bdii/etc/bdii.conf&lt;/span&gt;. However an easier way to maintain the service in case it needs reinstallation is to add &lt;span style="font-weight:bold;"&gt;SLAPD=/usr/sbin/slapd2.4&lt;/span&gt; to site-info.def so that when YAIM runs it gets added to /etc/sysconfig/bdii and doesn't need a manual step is the machine is reinstalled.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3355555367352141561?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3355555367352141561/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3355555367352141561' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3355555367352141561'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3355555367352141561'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/05/bdii-again.html' title='BDII again'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-9203073398818776514</id><published>2011-05-04T10:11:00.021Z</published><updated>2011-05-20T07:01:54.616Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester bdii'/><title type='text'>BDII follow up</title><content type='html'>To decrease the need of restarting the BDII and following &lt;a href="https://www.jiscmail.ac.uk/cgi-bin/webadmin?A1=ind1105&amp;L=TB-SUPPORT#3"&gt;the discussion on tb-support&lt;/a&gt; I decided to upgrade to openldap2.4. Since I was at it I also updated both glite-BDII_site and glite-BDII_top (below the list of new rpms) to the latest repositories division since we still had the older common glite-BDII repo. The newest version of BDII has also new paths for most things. For example some config files have been moved to /etc/bdii and /var/run/bdii is the new SLAPD_VAR_DIR. The setting up of the repos are peculiar to Manchester where we mirror a latest version every day but the machines pick up from a stable repository that is updated when needed.&lt;br /&gt;&lt;br /&gt;1) rsync glite-BDII_site and glite-BDII_top from Glite-3.2-latest to Glite-3.2 stable&lt;br /&gt;&lt;br /&gt;2) Added the rpm to the local external repository from the BDII_top RPMS.external dir so it can be picked up also by BDII_site and if the case also CEs and SE.&lt;br /&gt;&lt;br /&gt;3) Create new repo files and added them to cvs&lt;br /&gt;&lt;br /&gt;4) Edited cf.yaim-repos to copy them&lt;br /&gt;&lt;br /&gt;5) Installed manually (yum install) the rpms openldap2.4 openldap2.4-servers and their dependencies lib64ldap2.4  openldap2.4-extraschemas on BDII_site. In the glite-BDII_top case they are called in as dependencies so there is no need for this. &lt;br /&gt;   &lt;span style="font-weight:bold;"&gt;# This step can be added in cfengine at a later stage if needed.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;6) mv /opt/bdii/etc/bdii.conf.rpmnew /opt/bdii/etc/bdii.conf  &lt;br /&gt;   &lt;span style="font-weight:bold;"&gt;# Contains the pointer to the new bdii-slapd.conf which contains the new paths. bdii/slapd won't restart with the old bdii.conf.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;7) Add SLAPD=/usr/sbin/slapd2.4 to the new /opt/bdii/etc/bdii.conf &lt;br /&gt;   &lt;span style="font-weight:bold;"&gt;# This can go in yaim post function if one really wants.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;8) Rerun YAIM&lt;br /&gt;&lt;br /&gt;9) Reduced the rate the cron job checks the bdii from 5 to 20 mins. Top bdii seemed to take longer to rebuild probably due to an expired cache causing a loop.&lt;br /&gt;&lt;br /&gt;Crossing fingers it will work and stop the BDII periodically hanging.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;New Site BDII RPMS&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;bdii-5.1.22-1&lt;br /&gt;bdii-config-site-0.9.1-1&lt;br /&gt;glite-BDII_site-3.2.11-1.sl5&lt;br /&gt;glite-yaim-bdii-4.1.12-1&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;New Top BDII RPMS&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;bdii-5.1.22-1&lt;br /&gt;bdii-config-top-0.0.9-1&lt;br /&gt;glite-BDII_top-3.2.11-1.sl5&lt;br /&gt;glite-yaim-bdii-4.1.12-1&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Openldap2.4 RPMS&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;lib64ldap2.4_2-2.4.22-1.el5&lt;br /&gt;openldap2.4-2.4.22-1.el5&lt;br /&gt;openldap2.4-extra-schemas-1.3-10.el5&lt;br /&gt;openldap2.4-servers-2.4.22-1.el5&lt;br /&gt;&lt;br /&gt;UPDATE 20/&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-9203073398818776514?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/9203073398818776514/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=9203073398818776514' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9203073398818776514'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9203073398818776514'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/05/bdii-follow-up.html' title='BDII follow up'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8950125233418266592</id><published>2011-04-07T07:31:00.007Z</published><updated>2011-05-04T10:41:51.069Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester bdii'/><title type='text'>Check BDII script updated</title><content type='html'>Yesterday the top BDII stopped working rather than the site BDII. It crashed. The pid file was still there but the process was not running. &lt;br /&gt;&lt;br /&gt;So I adjusted the script to use a different query that works on all levels of bdii (resource, site, top) looking for o=infosys rather than o=grid and some specific attribute. &lt;br /&gt;&lt;br /&gt;I also looked at the bdii startup script and it does a good job at cleaning up processes and lock/pid files in the stop function so I just use service bdii restart whether the process is there or not only the alert remains different in the two cases.&lt;br /&gt;&lt;br /&gt;New version is still in &lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh"&gt;http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8950125233418266592?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8950125233418266592/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8950125233418266592' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8950125233418266592'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8950125233418266592'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/04/check-bdii-script-updated.html' title='Check BDII script updated'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3111112947146631765</id><published>2011-04-04T15:11:00.008Z</published><updated>2011-04-06T10:43:55.260Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester scripts system administration monitoring sharing'/><title type='text'>Sharing scripts</title><content type='html'>in my Northgrid talk at GridPP I pointed out we all do the same things but in a slightly different way I thought it'd be good to resume the thread on sharing management/monitoring tools. I always thought building a repository was a good thing and I still do.&lt;br /&gt;&lt;br /&gt;I think the tools should be as generic as possible but do not need to be perfect. Of course if scripts work out of the box it's a bonus but they might be useful also  to improve local tools with additional checks one might not have thought about.&lt;br /&gt;&lt;br /&gt;I'll start with a couple of scripts I rewrote last Monday to make them more robust:&lt;br /&gt;&lt;br /&gt;-- Check the BDII:&lt;br /&gt;&lt;br /&gt;   &lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh"&gt;http://www.sysadmin.hep.ac.uk/svn/fabric-management/processes/monitoring/testbdii.sh&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The original script was checking a network connection exist if it didn't exist it restarted the bdii service. &lt;br /&gt;&lt;br /&gt;The new version checks the slapd is responsive, if it isn't checks if there is a hung process, if there is it kills it and restarts the bdii, if there isn't just restarts the bdii.&lt;br /&gt;&lt;br /&gt;-- Check Host Certificate End Date:&lt;br /&gt;&lt;br /&gt;    &lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/check-host-cert-date.sh"&gt;http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/check-host-cert-date.sh&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The old version was just checking if the certificate was expired and sent an alert. Not very useful in itself as it picks the problem when the damage is already done. &lt;br /&gt;&lt;br /&gt;The old version checks that, because it might be useful if machines have been down for a while, and also it starts to send alerts 30 days before the expiration date. Finally if the certificate is not there it asks the obvious question should you be running this script on this machine?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3111112947146631765?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3111112947146631765/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3111112947146631765' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3111112947146631765'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3111112947146631765'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2011/04/sharing-scripts.html' title='Sharing scripts'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6659973328023758409</id><published>2010-09-09T12:42:00.015Z</published><updated>2010-09-09T14:20:26.679Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester new hardware computing nodes storage'/><title type='text'>Manchester new hardware</title><content type='html'>We are in the process of installing the new hardware. I knew it was going to be compact but one thing is reading on paper that a 2U unit has 48 CPUs and can replace 24 of the old 1U machines,  and another is seeing it. The old cluster 900 machines grandiosity half gone: 20 of the new machines replacing 450 of the old one in terms of cores. Our new little jewels. :) &lt;br /&gt;&lt;br /&gt;The first 3 rows in each rack are the computing nodes, the machines at the bottom are the storage units. The storage also has become unbelievably compact and cheap. When we bought the DELL cluster 500TB was an enormity and extremely expensive if organised in proper data servers and this is why we tried to use the WNs disks. The new storage is 540TB of usable storage, fits in 9 4U machines and is considered commodity computing nowadays. Well... almost. ;)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjZ0vnpFkI/AAAAAAAAAWQ/s0JXadhyB9o/s1600/line-up.jpeg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 213px;" src="http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjZ0vnpFkI/AAAAAAAAAWQ/s0JXadhyB9o/s320/line-up.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5514897243874334274" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjY81CAEPI/AAAAAAAAAV4/Us0ZvygRwt0/s1600/front-close.jpeg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 213px;" src="http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjY81CAEPI/AAAAAAAAAV4/Us0ZvygRwt0/s320/front-close.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5514896283254395122" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjZiCUPAgI/AAAAAAAAAWA/tuKElBYlvjo/s1600/rear-close.jpeg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 213px;" src="http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjZiCUPAgI/AAAAAAAAAWA/tuKElBYlvjo/s320/rear-close.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5514896922475692546" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_nB3QAxoLs5o/TIjZsmeIfsI/AAAAAAAAAWI/pzb59yktokI/s1600/rear-far.jpeg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 320px; height: 213px;" src="http://1.bp.blogspot.com/_nB3QAxoLs5o/TIjZsmeIfsI/AAAAAAAAAWI/pzb59yktokI/s320/rear-far.jpeg" border="0" alt=""id="BLOGGER_PHOTO_ID_5514897103979577026" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6659973328023758409?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6659973328023758409/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6659973328023758409' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6659973328023758409'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6659973328023758409'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/09/manchester-new-hardware.html' title='Manchester new hardware'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_nB3QAxoLs5o/TIjZ0vnpFkI/AAAAAAAAAWQ/s0JXadhyB9o/s72-c/line-up.jpeg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6695444791729291835</id><published>2010-09-04T06:59:00.028Z</published><updated>2010-09-22T14:21:03.098Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester squid mrtg snmp atlas monitoring'/><title type='text'>How to enable Atlas squid monitoring</title><content type='html'>Atlas has started to monitor squids with mrtg. &lt;br /&gt;&lt;br /&gt;&lt;a href="http://frontier.cern.ch/squidstats/indexatlas.html"&gt;http://frontier.cern.ch/squidstats/indexatlas.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;mrtg uses snmp. So to enable the monitoring you need your squid instance compiled with --enable-snmp. &lt;a href="https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2"&gt;CERN binaries&lt;/a&gt; are already compiled with that option the default squid coming with SL5 OS might not, your site centralised squid service might not. You don't need &lt;span style="font-style:italic;"&gt;snmpd&lt;/span&gt; or &lt;span style="font-style:italic;"&gt;snmptracd&lt;/span&gt; (&lt;span style="font-style:italic;"&gt;net-snmp&lt;/span&gt; rpm) running to make it work.&lt;br /&gt;&lt;br /&gt;Once you are sure the binary is compiled with the right options and that port 3401 is not blocked by any firewall you need to add these lines to &lt;span style="font-style:italic;"&gt;squid.conf&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;acl SNMPHOSTS src 128.142.202.0/24 localhost&lt;br /&gt;acl SNMPMON snmp_community public&lt;br /&gt;snmp_access allow SNMPMON SNMPHOSTS&lt;br /&gt;snmp_access deny all&lt;br /&gt;snmp_port 3401&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;again if you are using the CERN rpms and the default frontier configuration you might not need to do that as there are already ACL lines for the monitoring. &lt;br /&gt;&lt;br /&gt;Reload the configuration&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;service squid reload&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Test it &lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;snmpwalk  -v2c -Cc -c public localhost:3401 .1.3.6.1.4.1.3495.1.1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;you should get something similar to this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;SNMPv2-SMI::enterprises.3495.1.1.1.0 = INTEGER: 206648&lt;br /&gt;SNMPv2-SMI::enterprises.3495.1.1.2.0 = INTEGER: 500136&lt;br /&gt;SNMPv2-SMI::enterprises.3495.1.1.3.0 = Timeticks: (23672459) 2 days, 17:45:24.59&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;snmpwalk&lt;/span&gt; is part of &lt;span style="font-style:italic;"&gt;net-snmp-utils&lt;/span&gt; rpm.&lt;br /&gt;&lt;br /&gt;It takes a while for the monitoring to catch up. Don't expect an immediate response. &lt;br /&gt;&lt;br /&gt;Additional information on squid/snmp can be found here &lt;a href="http://wiki.squid-cache.org/Features/Snmp"&gt;http://wiki.squid-cache.org/Features/Snmp&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;NOTE:&lt;/span&gt; If you are also upgrading pay attention to the fact that in the latest CERN rpms the init script fn-local-squid.sh might try to regenerate your squid.conf.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6695444791729291835?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6695444791729291835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6695444791729291835' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6695444791729291835'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6695444791729291835'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/09/how-to-enable-atlas-squid-monitoring.html' title='How to enable Atlas squid monitoring'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7182955013993167236</id><published>2010-08-31T23:14:00.017Z</published><updated>2010-09-01T07:47:43.424Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='atlas file system tweak xfs sl5 jobs efficiency manchester'/><title type='text'>Atlas jobs in Manchester</title><content type='html'>August has seen a really a notable increase of Atlas user pilot jobs. Over 34000 jobs of which more than 12000 just in the last 4 days. Plotting the number of jobs since the beginning of the year there has been an inversion between production and users pilots.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_nB3QAxoLs5o/TH2OPnPST4I/AAAAAAAAAVQ/wBgtvVkb5fo/s1600/num-of-atlas-jobs.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://4.bp.blogspot.com/_nB3QAxoLs5o/TH2OPnPST4I/AAAAAAAAAVQ/wBgtvVkb5fo/s320/num-of-atlas-jobs.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5511717917853634434" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The trend in August was probably helped by moving all the space to the DATADISK space token and attracting more interesting data. LOCALGROUP is also heavily used in Manchester.&lt;br /&gt;&lt;br /&gt;In the past 4 days we also have applied the &lt;a href="http://northgrid-tech.blogspot.com/2010/08/tuning-areca-raid-controllers-for-xfs.html"&gt;XFS file system tuning&lt;/a&gt; suggested by John that solves the load on the data servers experienced since upgrading to SL5. The tweak has increased notably the data throughput and reduced the load on the data servers practically to zero allowing us to increase the number of concurrent jobs. This has allowed a bigger job throughput and has had a clear improvement on the job efficiency isolating as most inefficient the very short ones (&amp;lt;10 mins CPU time) and even then the improvement is also notable as it is possible to see from the plots below.&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Before applying the tweak&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_nB3QAxoLs5o/TH2VklhpJpI/AAAAAAAAAVY/UW5fOShD594/s1600/atlas-jobs-eff-vs-cput-noteff.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://3.bp.blogspot.com/_nB3QAxoLs5o/TH2VklhpJpI/AAAAAAAAAVY/UW5fOShD594/s320/atlas-jobs-eff-vs-cput-noteff.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5511725974752405138" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;After applying the tweak&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_nB3QAxoLs5o/TH2WjadwuyI/AAAAAAAAAVg/Ju4Vqzgtzao/s1600/atlas-jobs-eff-vs-cput-eff.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://1.bp.blogspot.com/_nB3QAxoLs5o/TH2WjadwuyI/AAAAAAAAAVg/Ju4Vqzgtzao/s320/atlas-jobs-eff-vs-cput-eff.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5511727054115093282" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;This also means we can keep on using XFS for the data servers which has currently more flexibility as far as partition sizes are concerned.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7182955013993167236?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7182955013993167236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7182955013993167236' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7182955013993167236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7182955013993167236'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/08/number-of-atlas-job-in-manchester.html' title='Atlas jobs in Manchester'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_nB3QAxoLs5o/TH2OPnPST4I/AAAAAAAAAVQ/wBgtvVkb5fo/s72-c/num-of-atlas-jobs.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8343671891529650212</id><published>2010-08-31T12:48:00.001Z</published><updated>2010-08-31T17:24:12.004Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='file system xfs sl5 read ahead tweak kernel liverpool'/><title type='text'>Tuning Areca RAID controllers for XFS on SL5</title><content type='html'>&lt;pre wrap=""&gt;Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).&lt;/pre&gt;&lt;pre wrap=""&gt;&lt;/pre&gt;&lt;pre wrap=""&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;blockdev --setra 16384 &lt;/span&gt;&lt;i class="moz-txt-slash" style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span class="moz-txt-tag"&gt;/&lt;/span&gt;dev&lt;span class="moz-txt-tag"&gt;/&lt;/span&gt;&lt;/i&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;$RAIDDEVICE&lt;/span&gt;This sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.The mystery for us is more why the SL4 systems &lt;b class="moz-txt-star"&gt;&lt;span class="moz-txt-tag"&gt;*&lt;/span&gt;don't&lt;span class="moz-txt-tag"&gt;*&lt;/span&gt;&lt;/b&gt; overload rather than why SL5 does, as the SL4 systems use the exact same default values.Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.&lt;a class="moz-txt-link-freetext" href="http://hep.ph.liv.ac.uk/%7Ejbland/xfs-fix.html"&gt;http://hep.ph.liv.ac.uk/~jbland/xfs-fix.html&lt;/a&gt;Any time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size with&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;echo "64" &amp;gt; &lt;/span&gt;&lt;i class="moz-txt-slash" style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;&lt;span class="moz-txt-tag"&gt;/&lt;/span&gt;sys/block&lt;span class="moz-txt-tag"&gt;/&lt;/span&gt;&lt;/i&gt;&lt;span style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;$RAIDDEVICE/queue/max_sectors_kb&lt;/span&gt;although we don't think this had any bearing on the overloading and might not be necessary.&amp;nbsp;&lt;/pre&gt;&lt;pre wrap=""&gt;We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8343671891529650212?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8343671891529650212/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8343671891529650212' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8343671891529650212'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8343671891529650212'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/08/tuning-areca-raid-controllers-for-xfs.html' title='Tuning Areca RAID controllers for XFS on SL5'/><author><name>John Bland</name><uri>http://www.blogger.com/profile/16051241409269392358</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-909458889291272489</id><published>2010-08-18T13:35:00.013Z</published><updated>2010-08-18T17:57:16.621Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester file systems worker nodes evaluation'/><title type='text'>ext4 vs ext3  round(1)</title><content type='html'>I started yesterday to look at the ext4 vs ext3 performance with iozone. I installed two old dell WNs with the same file system layout, same raid level 0, but one with ext3 and one with ext4 on / and on /scratch the directories used by the jobs. Both machines have the default mount values and the kernel is 2.6.18-194.8.1.el5.&lt;br /&gt;&lt;br /&gt;I performed the tests on /scratch partition writing the log in /. I did it twice one mounting and unmounting the fs at each test so to delete any trace of information from the buffer cache and one leaving the fs mounted between tests. Tests were automatically repeated for sizes from 64kB to 4GB and record length between 4kB - 16384kB. Iozone automatically doubles the previous sizes at each test (4GB is the smallest multiples smaller than the 5GB file size limit I set). &lt;br /&gt;&lt;br /&gt;From the numbers ext4 performs much better in writing while reading is basically the same if not slightly worst for smaller files. There is a big drop in performance for both file systems for the 4GB size.&lt;br /&gt;&lt;br /&gt;What however I find confusing is that I did the tests again setting the max size of the file at 100M and doing only write tests and ext3 takes less time despite (22 secs vs 44s in this case) despite the numbers saying that writing is almost 40% faster there is something that slows the tests down (deleting?). Speed of tests become similar for sizes &gt;500MB they both decrease steadily until they finally drop at 4GB for any record length in both file systems.&lt;br /&gt;&lt;br /&gt;Below some results mostly with the buffer cache because not having it affects mostly ext3 for small sizes of file and rec length as shown in the first graph.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;EXT3: write (NO buffer cache)&lt;br&gt;==================&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext3/nocache/write/write.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext3/nocache/write/write.png" border="0" alt="" /&gt;&lt;/a&gt; &lt;br /&gt;&lt;p&gt;&lt;span style="font-weight:bold;"&gt;EXT3: write (buffer cache)&lt;br&gt;==================&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext3/cache/write/write.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext3/cache/write/write.png" border="0" alt="" /&gt;&lt;/a&gt; &lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;EXT4: write (buffer cache)&lt;br&gt;==================&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext4/cache/write/write.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext4/cache/write/write.png" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;EXT3: read (buffer cache)&lt;br&gt;==================&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext3/cache/read/read.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext3/cache/read/read.png" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;EXT4: read (buffer cache)&lt;br&gt;==================&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext4/cache/read/read.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 500px; height: 375px;" src="http://www.hep.manchester.ac.uk/u/aforti/iozone/ext4/cache/read/read.png" border="0" alt="" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-909458889291272489?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/909458889291272489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=909458889291272489' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/909458889291272489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/909458889291272489'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/08/ext4-vs-ext3-round1.html' title='ext4 vs ext3  round(1)'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-9061964453565071647</id><published>2010-08-04T10:19:00.008Z</published><updated>2010-08-04T10:39:06.140Z</updated><title type='text'>Biomed VOMS server CA DN has changed</title><content type='html'>Biomed is opening GGUS tickets for non working sites. Apparently they are getting organised and they have someone to do some sort of support now.&lt;br /&gt;&lt;br /&gt;They opened one for Manchester too - actually we were slightly flooded with tickets we must have some decent data on the storage.&lt;br /&gt;&lt;br /&gt;The problem turned out to be that on the 18/6/2010 the biomed VOMS server CA DN has changed. If you find these messages (if you google for them you get only source code entries) on your DPM srmv2.2 log files: &lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;08/03 11:14:30  4863,0 srmv2.2: SRM02 - soap_serve error : [::ffff:134.214.204.110] (kingkong.creatis.insa-lyon.fr) : CGSI-gSOAP running on bohr3226.tier2.hep.manchester.ac.uk reports Error retrieveing the VOMS credentials&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;than you know you must update the configuration on your system replacing&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;/C=FR/O=CNRS/CN=GRID-FR&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;with &lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;/C=FR/O=CNRS/CN=GRID2-FR&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;in&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;/etc/grid-security/vomsdir/biomed/cclcgvomsli01.in2p3.fr.lsc&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Note:&lt;/span&gt; don't forget YAIM too if you don't want to override. I updated the YAIM configuration on the GridPP wiki &lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#IN2P3_VOMS_server_VOs"&gt;http://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#IN2P3_VOMS_server_VOs&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-9061964453565071647?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/9061964453565071647/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=9061964453565071647' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9061964453565071647'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9061964453565071647'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/08/biomed-voms-server-ca-dn-has-changed.html' title='Biomed VOMS server CA DN has changed'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5577247641368641174</id><published>2010-07-27T11:50:00.012Z</published><updated>2010-07-27T14:17:08.871Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='manchester apel sl5 glite installation'/><title type='text'>Moving APEL to SL5</title><content type='html'>We have moved APEL from the SL4 MON box to the SL5 version that works standalone without RGMA (finally!). The site BDII has also been transferred on this machine from the MON box. This is how it is set up. &lt;br /&gt;&lt;br /&gt;*) Request a certificate for the machine if you don't have one already.&lt;br /&gt;*) Kickstart a machine vanilla SL5, two raid1 disks.&lt;br /&gt;*) Install mysql-server-5.0.77-3.el5 (it's in the SL5 repository)&lt;br /&gt;*) Remove /var/lib/mysql and recreated it empty (you can skip this but I messed around with it earlier and needed a clean dir).&lt;br /&gt;*) Start mysqld&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;service mysqld start&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It will tell you at this point to create the root password.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;/usr/bin/mysqladmin -u root password 'pwd-in-site-info.def''&lt;br /&gt;/usr/bin/mysqladmin -u root -h &amp;lt;machine-fqdn&amp;gt; password 'pwd-in-site-info.def'&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;*) Install the certificate (we have it directly in cfengine).&lt;br /&gt;*) Setup the yum repositories if your configuration tool doesn't do it already&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;cd /etc/yum.repos.d/&lt;br /&gt;wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.2/glite-APEL.repo&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;*) Install glite-APEL&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;yum install glite-APEL&lt;/span&gt;&lt;br /&gt;  &lt;br /&gt;*) Run yaim: it sets up the database and most of all the ACL, if you have more than one CE to publish you need to run it for each CE changing the name of the CE in site-info.def or if you are skilled with SQL you need to set the permissions for each CE to have write access.&lt;br /&gt;&lt;br /&gt; &lt;span style="font-style:italic;"&gt;/opt/glite/yaim/bin/yaim -s /opt/glite/yaim/etc/site-info.def -c -n glite-APEL&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;*) &lt;span style="font-weight:bold;"&gt;BUG:&lt;/span&gt; APEL still uses JAVA. Anytime it is run it creates a JAVA key store with all the CAs and host certificate added to it. It might happen that on your machine you get the OS JAVA version and the one you install (normally 1.6). The tool used to create the keystore file is called by a script without setting the path so if you have both versions of the command it is likely that the OS one is called because it resides in /usr/bin. Useless to say the OS version is older and doesn't have all the options used in the APEL script. There are a number of ways to fix this I modified the script to insert absolute path, you can change the link target in /usr/bin or you can add a modified path to the apel cron job. The culprit script is this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;/opt/glite/share/glite-apel-publisher/scripts/key_trust_store_maker.sh&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;and belongs to &lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;glite-apel-publisher-2.0.12-7.rpm&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The problem is known and apparently a fix is in certification. My ticket is here&lt;br /&gt;&lt;br /&gt;&lt;a href="https://gus.fzk.de/ws/ticket_info.php?ticket=60452"&gt;https://gus.fzk.de/ws/ticket_info.php?ticket=60452&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;*) Register the machine in GOCDB making sure you tick glite-APEL and not APEL to mark it as a service.&lt;br /&gt;&lt;br /&gt;*) &lt;span style="font-weight:bold;"&gt;BUG:&lt;/span&gt; UK host certificates have an email attribute. This email has a different format in the output of different clients. When you register the machine put the host DN as it is. Then open a GGUS ticket for APEL so they can change it internally. This is also known and followed in this savannah bug but at the moment they have to change it manually. Below the savannah bug.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://savannah.cern.ch/bugs/?70628"&gt;https://savannah.cern.ch/bugs/?70628&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;*) Dump the DB on on the old MON box with mysqldump. I thought I could tar it up but it didn't like it so I used this instead.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;mysqldump -C -Q -u root -p accounting | gzip -c &gt; accounting.sql.gz&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;*) Copy to and reload on the new machine&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;zcat accounting.sql.gz | mysql -u root -p accounting&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;*) Run APEL manually and see how it goes (command is in the cron job).&lt;br /&gt;&lt;br /&gt;If you are happy with it go on with the last two steps, otherwise you have found an additional problem I haven't found.&lt;br /&gt;&lt;br /&gt;*) Disable the publisher on the old machine, i.e. remove the cron job.&lt;br /&gt;&lt;br /&gt;*) Modify parser-config-yaim.xml for all the CEs so they point to the new machine. The line to modify is &lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;&amp;lt;DBURL&amp;gt;jdbc:mysql://&amp;lt;new-machine-fqdn&amp;gt;:3306/accounting&amp;lt;/DBURL&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;SWITCHING OFF RGMA&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When I was happy with the new APEL machine I turned off RGMA and removed it from the services published by the BDII and the GOCDB. This caused the GlueSite object to disappear from our site BDII. You need to have the site BDII in the list of services published before you remove RGMA.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5577247641368641174?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5577247641368641174/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5577247641368641174' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5577247641368641174'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5577247641368641174'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/07/moving-apel-to-sl5.html' title='Moving APEL to SL5'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5418726371193140819</id><published>2010-06-29T09:20:00.006Z</published><updated>2010-06-29T13:34:39.931Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='atlas'/><category scheme='http://www.blogger.com/atom/ns#' term='fabric'/><title type='text'>Occupying Lancaster's new data centre</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_mw1Yh1GNWVc/TCm_c-Y4T4I/AAAAAAAABXo/5vXIdl2ECOQ/s1600/DSC_0089.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 213px; height: 320px;" src="http://3.bp.blogspot.com/_mw1Yh1GNWVc/TCm_c-Y4T4I/AAAAAAAABXo/5vXIdl2ECOQ/s320/DSC_0089.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5488128125432254338" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Lancaster had a scheduled downtime yesterday to relocate half our storage and CPU to the new High End Computing data centre. The move went very smoothly and both storage and CPU are back online running ATLAS (and other) jobs. The new HEC facility is a significant investment from Lancaster University with the multi million pound building housing central IT facilities, co-location, and HEC data centres.&lt;br /&gt;&lt;br /&gt;Lancaster's GridPP operations have a large stake in HEC with Roger Jones (GridPP/ATLAS) taking directorship of the facility. Our future hardware procurement will be located in this room which has a 35-rack capacity using water-cooled Rittel racks. Below are some photographs of the room as it looks today. Ten racks are in place with two being occupied by GridPP hardware and the remaining eight to be populated in July with £800k worth of compute and storage resource.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_mw1Yh1GNWVc/TCnBRz4zLXI/AAAAAAAABXw/GeEdxWqeyF4/s1600/DSC_0092.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 213px;" src="http://2.bp.blogspot.com/_mw1Yh1GNWVc/TCnBRz4zLXI/AAAAAAAABXw/GeEdxWqeyF4/s320/DSC_0092.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5488130132658040178" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_mw1Yh1GNWVc/TCnBbqq6WNI/AAAAAAAABX4/riytwcEU6gU/s1600/DSC_0096.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 213px;" src="http://4.bp.blogspot.com/_mw1Yh1GNWVc/TCnBbqq6WNI/AAAAAAAABX4/riytwcEU6gU/s320/DSC_0096.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5488130301982562514" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5418726371193140819?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5418726371193140819/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5418726371193140819' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5418726371193140819'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5418726371193140819'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/06/occupying-lancasters-new-data-centre.html' title='Occupying Lancaster&apos;s new data centre'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_mw1Yh1GNWVc/TCm_c-Y4T4I/AAAAAAAABXo/5vXIdl2ECOQ/s72-c/DSC_0089.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4312673632861897113</id><published>2010-04-20T13:35:00.003Z</published><updated>2010-04-20T13:50:00.990Z</updated><title type='text'>Scaling, capacity publishing and accounting</title><content type='html'>&lt;b&gt;Introduction&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;span style=""&gt;Our main cluster at Liverpool is homogeneous at&lt;/span&gt; present, but that will change shortly. This document is part of our preparation for supporting &lt;span style=""&gt;a heterogeneous cluster. It will help us to find the scaling values and HEPSPEC06 power settings for correct publishing and accounting in a heterogeneous cluster that uses TORQUE/Maui without sub-clustering. I haven't fully tested the work, but I think it's correct and &lt;/span&gt;I hope it's useful for sysadmins at other sites.  &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;Clock limits&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;There are two types of time limit operating on our cluster; a wall clock limit and a CPU  time limit. The wall clock limit is the duration in real time that a job can last before it gets killed by the system. The CPU time limit is the amount of time that the job can run on a CPU before it gets killed. Example: Say you have a single CPU system, with two slots on it. When busy, it would be running two jobs. Each job would get about half the CPU. Therefore, it would make good sense to give it a wall clock limit of around twice the CPU limit, because the CPU is divided between two tasks. In systems where you have one job slot per CPU, wall and CPU limits are around the same value.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Notes:  &lt;/p&gt;  &lt;ol&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in;"&gt;In practise, a job may not make  efficient use of a CPU – it may wait for input or output to occur.  A factor is sometimes applied to the wall clock limit to try to  account for CPU inefficiencies. E.g. on our single CPU system, with  two job slots on it, we might choose a wall clock limit of twice the  CPU limit, then add some margin to the wall clock limit to account  for CPU inefficiency.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;/p&gt;  &lt;/li&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in;"&gt;From now on, I assume that  multi-core nodes have a full complement of slots, and are not  over-subscribed, i.e. a system with N cores has N slots.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;/p&gt; &lt;/li&gt;&lt;/ol&gt; &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;Scaling factors&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;A scaling factor is a value that makes systems comparable. Say you have systemA, which executes 10 billion instructions per second. Let’s say you have a time limit of 10 hours. We assume one job slot per CPU. Now let us say that, after a year, we add a new system to our cluster, say systemB, which runs 20 billion instructions per second.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;What should we do about the time limits to account for the different CPU speeds? We could have separate limits for each system, or we could use a scaling factor. The scaling factor could be equal to 1 on systemA and 2 on systemB. When deciding whether to kill a job, we take the time limit, and divide it by the scaling factor. This would be 10/1 for system A, and 10/2 for systemB. Thus, if the job has been running on systemB for more than 5 hours, it gets killed. We normalised the clock limits and made them comparable.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;This can be used to expand an existing cluster with faster nodes, without requiring different clock limits for each node type. You have one limit, with different scaling factors.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;Another use for the scaling factor is for accounting&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The scaling factor is applied on the nodes themselves. The head node, that dispatches jobs to the worker nodes, doesn’t know about the normalised time limits. It regards all the nodes to have the same time limits. The same applies to accounting. The accounting system doesn’t know the real length of time spent by a particular system, even though 2 hours on systemB is worth as much work as 4 hours on systemA. It is transparent.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Example: The worker nodes also normalise the accounting figures. Let’s assume a job exists that takes four hours of CPU on systemA. The calculation for the accounting would be: usage = time * scaling factor, yielding 4 * 1 = 4 hours of CPU time (I’ll tell about the units used for CPU usage shortly.) The same job would take 2 hours on systemB. The calculation for the accounting would be: usage = time * scaling factor, yielding 2 * 2 = 4 hours, i.e. though the systemB machine is twice as fast, the figures for accounting still show that the same amount of work is done to complete the job.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;The CPUs at a site&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The configuration at our site contains these two definitions:&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;CE_PHYSCPU=776&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;CE_LOGCPU=776&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;They describe the number of CPU chips in our cluster and the total number of logical CPUs. The intent here is to model the fact that, very often, each CPU chip has multiple logical “computing units” within it (although not on our main cluster, yet). This may be implemented as multiple-cores or hyperthreading or other things. But the idea is that we have silicon chips with other units inside that do data processing. Anyway, in our case, we have 776 logical CPUs. And we have the same number of physical CPUs because we have 1 core per CPU. We are in the process of moving our cluster to a mixed system, at which time the values for these variables will need to be worked out anew.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;Actual values for scaling factors&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Now that the difference between physical and logical CPUs is apparent, I'll show ways to work out actual scaling values. The simplest example consists of a homogeneous site without scaling, where some new nodes of a different power need to be added.  &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Example: Let's &lt;span lang="en-GB"&gt;imagine &lt;/span&gt;a &lt;span lang="en-GB"&gt;notional &lt;/span&gt; cluster of 10 systems with 2 physical CPUs with two cores in each (call this typeC) to &lt;span lang="en-GB"&gt;which&lt;/span&gt; we wish to add 5 &lt;span lang="en-GB"&gt;systems &lt;/span&gt; with 4 &lt;span lang="en-GB"&gt;physical&lt;/span&gt; CPUs with 1 core in each (call &lt;span lang="en-GB"&gt;these&lt;/span&gt; &lt;span lang="en-GB"&gt;typeD&lt;/span&gt;). To work out the &lt;span lang="en-GB"&gt;scaling&lt;/span&gt; factor to use in the new nodes, to make them &lt;span lang="en-GB"&gt;equivalent&lt;/span&gt; to the existing ones, we would use some benchmarking program, say HEPSPEC06, to obtain a value of the power for each system type. We then divide these values by the number of &lt;span lang="en-GB"&gt;logical&lt;/span&gt; CPUs in each system, &lt;span lang="en-GB"&gt;yielding&lt;/span&gt; a “per core” HEPSPEC06 value for each type of system. We can then work out the &lt;span lang="en-GB"&gt;scaling&lt;/span&gt; factor for the new system:&lt;br /&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in; font-family: courier new;"&gt;scaling_factor=type_d_per_core_hs06/ type_c_per_core_hs06&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The &lt;span lang="en-GB"&gt;resulting&lt;/span&gt; value is then used in the /var&lt;span lang="en-GB"&gt;/spool/&lt;/span&gt;pbs/mom_priv/config file of the new worker nodes, e.g.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;span style="font-family:Courier,monospace;"&gt;$cpumult 1.86&lt;/span&gt;&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;&lt;span style="font-family:Courier,monospace;"&gt;$wallmult 1.86&lt;/span&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;span style=""&gt;These values are taken from an &lt;span lang="en-GB"&gt;earlier&lt;/span&gt; cluster &lt;span lang="en-GB"&gt;set-up&lt;/span&gt; at our site that used scaling. As a more complex &lt;span lang="en-GB"&gt;example&lt;/span&gt;, it would &lt;span lang="en-GB"&gt;be possible &lt;/span&gt; to define some &lt;span lang="en-GB"&gt;notional&lt;/span&gt; &lt;span lang="en-GB"&gt;reference&lt;/span&gt; &lt;span lang="en-GB"&gt;strength&lt;/span&gt; to a round number and scale all CPUs to that value, using a similar procedure as &lt;span lang="en-GB"&gt;above&lt;/span&gt;, but including all the nodes in the cluster, i.e. all nodes would have scaling values. The reference strength would be &lt;/span&gt;abstract.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;The power at a site&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The following information is used in conjunction with the number of CPUs at a site,  to calculate the full power.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;CE_OTHERDESCR=Cores=1,Benchmark=5.32-HEP-SPEC06&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;This has got parts. The first, Cores=, is the number of cores in each physical CPU in our system. In our case, it’s exactly 1. But if you have, say, 10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each, the values would be as follows:&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;CE_PHYSCPU=(10 x 2) + (5 x 4) = 40&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;CE_LOGCPU=(10 x 2 x 2) + (5 x 4×1) = 60&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Therefore, at this site, Cores = CE_LOGCPU/CE_PHYSCPU = 60/40 = 1.5&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The 2nd part is Benchmark=. In our case, this is the HEP-SPEC06 value from one of our worker nodes. The HEP-SPEC06 value is an estimate of the CPU strength that comes out of a benchmarking program. In our case, it works out at 5.32, and it is an estimate of the power of one logical CPU. This was easy to work out in a cluster with only one type of hardware. But if you have the notional cluster described above (10 systems with 2 physical CPUs with 2 cores in each, and 5 systems with 4 physical CPUs with 1 core in each) you’d have to compute it more generically, like this:&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Using the benchmarking program, find the value for the total HEP-SPEC06 for each type of system in your cluster. Call these the SYSTEM-HEP-SPEC06 for each system type (alternatively, if the benchmark script allows it, compute the PER-CORE-HEP-SPEC06 value for each type of system, and compute the SYSTEM-HEP-SPEC06 value for that system type by multiplying the PER-CORE-HEP-SPEC06 value by the number of cores in that system).&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;For each type of system, multiply the SYSTEM-HEP-SPEC06 value by the number of systems of that type that you have, yielding SYSTEM-TYPE-HEP-SPEC06 value – the total power of all of the systems of that type that you have in your cluster.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Add all the different SYSTEM-TYPE-HEP-SPEC06 values together, giving the TOTAL-HEP-SPEC06 value for your entire cluster.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Divide the TOTAL-HEP-SPEC06 by the CE_LOGCPU value, giving an average strength of a single core. This is the value that goes in the Benchmark= variable, mentioned above. Rational: this value could be multiplied by the CE_LOGCPU to give the full strength, independently of the types of node.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;i&gt;&lt;b&gt;Stephen Burke explains&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Stephen explained how you can calculate your full power, like this:  &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;i&gt;The installed capacity accounting will calculate your total CPU capacity as LogicalCPUs*Benchmark, so you should make sure that both of those values are right, i.e. LogicalCPUs should be the total number of cores in the sub cluster, and Benchmark should be the *average* HEPSPEC for those cores. (And hence Cores*Benchmark is the average power per physical CPU, although the accounting doesn’t use that directly.) That should be the real power of your system, regardless of any scaling in the batch system. &lt;/i&gt; &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;This means that, if we have the logical CPUs figure right, and the benchmark figure right, then the installed capacity will be published correctly. Basically, Cores * PhysicalCPUs must equal LogicalCPUs, and Cores * Benchmark gives the power per physical CPU.  &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;Configuration variables for power publishing and accounting&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;I described above how the full strength of the cluster can be calculated. We also want to make sure we can calculate the right amount of CPU used by any particular job, via the logs and the scaled times. The relevant configuration variables in site-info.def are:&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;CE_SI00=1330&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;CE_CAPABILITY=CPUScalingReferenceSI00=1330&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;On our site, they are both the same (1330). I will discuss that in a moment. But first, read what Stephen Burke said about the matter:&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;i&gt;If you want the accounting to be correct you .. have to .. scale the times within the batch system to a common reference. If you don’t … what you are publishing is right, but the accounting for some jobs will be under-reported. Any other scheme would result in some jobs being over-reported, i.e. you would be claiming that they got more CPU power than they really did, which is unfair to the users/VOs who submit those jobs.&lt;/i&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;In this extract, he is talking about making the accounting right in a heterogeneous cluster. We don’t have one yet, but we’ll cover the issues. He also wrote this:  &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;i&gt;You can separately publish the physical power in SI00 and the scaled power used in the batch system in ScalingReferenceSI00.&lt;/i&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Those are the variables used to transmit the physical power and the scaled power.  &lt;/p&gt;&lt;b&gt;&lt;br /&gt;Getting the power publishing right without sub-clustering&lt;/b&gt;  &lt;p style="margin-bottom: 0in;"&gt;First, I’ll discuss CE_SI00 (SI00). This is used to publish the physical computing power. I’ll show how to work out the value for our site.  &lt;/p&gt;  &lt;blockquote style="font-style: italic;"&gt;Note: Someone has decreed that (for the purposes of accounting) 1 x HEPSPEC06 is equal to 250 x SI00 (SpecInt2k). SpecInt2k is another benchmark program, so I call this accounting unit a bogoSI00, because it is distinct from the real, measured SpecInt2k and it is directly related to the real, HEPSPEC06 value.&lt;/blockquote&gt;  &lt;p style="margin-bottom: 0in;"&gt;I want to convert the average, benchmarked HEPSPEC06 strength of one logical CPU (5.32) into bogoSI00s by multiplying it by 250, giving 1330 on our cluster. As shown above, this value is a physical measure of the strength of one logical CPU, and it can be used to get a physical measure of the strength of our entire cluster by multiplying it by the total number of logical CPUs. The power publishing will be correct when I update the CE_SI00 value in site-info.def, and run YAIM etc.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;Getting the accounting right without sub-clustering&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Next, I’ll discuss getting the accounting right, which involves the CE_CAPABILITY=CPUScalingReferenceSI00 variable. This value will be used by the APEL accounting system to work out how much CPU has been provided to a given job. APEL will take the duration of the job, and multiply it by this figure to yield the actual CPU provided. As I made clear above, some worker nodes use a scaling factor to compensate for differences between worker nodes, i.e. a node may report adjusted CPU times, such that all the nodes are comparable. By scaling the times, the logs are adjusted to show the extra work that has been done. If the head node queries the worker node to see if a job has exceeded its wall clock limit, the scaled times are used, to make things consistent. This activity is unnoticeable in the head node and the accounting system.  &lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;There are several possible ways to decide the CPUScalingReferenceSI00 value, and you must choose one of them. I’ll go through them one at a time.&lt;/p&gt;  &lt;ul&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in;"&gt;Case 1: First, imagine a  homogeneous cluster, where all the nodes are the same and no scaling  on the worker nodes takes place at all. In this case, the  CPUScalingReferenceSI00 is the same as the value of one logical CPU,  measured in HEPSPEC06 and expressed as bogoSI00, i.e. 1330 in our  case. The accounting is unscaled, all the logical CPUs/slots/cores  are the same, so the reference value is the same as a physical one.&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt; &lt;p style="margin-left: 0.49in; margin-bottom: 0in;"&gt;&lt;i&gt;Example: 1 hour of CPU (scaled by 1.0) multiplied by 1300 = 1300 bogoSI00_hours delivered.&lt;/i&gt;&lt;/p&gt;  &lt;ul&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in;"&gt;Case 2: Next, image the same  cluster, with one difference – you use scaling on the worker nodes  to give the node a notional reference strength (e.g. to get round  numbers). I might want all my nodes to have their strength  normalised to 1000 bogoSI00s (4 x HEPSPEC06). I would use a scaling  factor on the worker nodes of 1.33. The time reported for a job  would be real_job_duration * 1.33. Thus, I could then set the  CPUScalingReferenceSI00=1000 to still get accurate accounting  figures.&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt; &lt;p style="margin-left: 0.49in; margin-bottom: 0in;"&gt;&lt;i&gt;Example: 1 hour of CPU (scaled by 1.33) multiplied by 1000 = 1300 bogoSI00_hours delivered, i.e. no change from case 1.&lt;/i&gt;&lt;/p&gt;  &lt;ul&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in;"&gt;Case 3: Next, imagine the cluster  in case 1 ( homogeneous cluster, no scaling), where I then add some  new, faster nodes, making a heterogeneous cluster. This happened at  Liverpool, before we split the clusters up. We dealt with this by  using a scaling factor on the new worker nodes, so that they  appeared equivalent in terms of CPU delivery to the existing nodes.  Thus, in that case, no change was required to the  CPUScalingReferenceSI00 value – it remained at 1330.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;i&gt;Example: 1 hour of CPU (scaled  locally by node-dependent value to make each node equivalent to 1330  bogoSI00) multiplied by 1300 = 1300 bogoSI00_hours delivered. No  change from case 1. The scaling is done transparently on the new  nodes.&lt;/i&gt;&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt;  &lt;ul&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in; font-style: normal;"&gt;Case 4:  Another line of attack in a heterogeneous cluster is to use a  different scaling factor on all the different node types, to  normalise the different nodes to the same notional reference  strength. As the examples above show, whatever reference value is  selected, the same number of  bogoSI00_hours is delivered in all  cases, if the scaling values are applied appropriately on the worker  nodes to make them equivalent to the reference strength choosen.&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt;  &lt;p style="margin-bottom: 0in;"&gt;&lt;b&gt;APEL changes to make this possible&lt;/b&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Formerly, APEL looked only at the SI00 value. If you scaled in a heterogeneous cluster, you could choose to have your strength correct or your accounting correct but not both.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;So APEL has been changed to alter the way the data is parsed out. This allows us to pass the scaling reference value in a variable called CPUScalingReferenceSI00. You may need to update your MON box to make use of this feature.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;Written by Steve with help from Rob and John, Liverpool&lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;19 April 2010&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4312673632861897113?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4312673632861897113/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4312673632861897113' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4312673632861897113'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4312673632861897113'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/04/scaling-capacity-publishing-and.html' title='Scaling, capacity publishing and accounting'/><author><name>Steve Jones</name><uri>http://www.blogger.com/profile/01633352566579646751</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-136606633161875470</id><published>2010-03-19T09:02:00.007Z</published><updated>2010-03-19T16:47:19.674Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester HammerCloud results'/><title type='text'>Latest HC tests in Manchester</title><content type='html'>While waiting for the storage that we are buying with the current tender that will bring us to 320 TB of usable space we are fixing the configuration to optimise the access on the current 80TB.&lt;br /&gt;So we have cabled the 4 data servers in the configuration they and their peers will have eventually. The last test&lt;br /&gt;&lt;br /&gt;&lt;a href="http://gangarobot.cern.ch/hc/1203/test"&gt;http://gangarobot.cern.ch/hc/1203/test&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;was showing some progress.&lt;br /&gt;&lt;br /&gt;We run it for 12 hours and we had 99% overall efficiency. In particular if compared to test &lt;br /&gt;&lt;br /&gt;&lt;a href="http://gangarobot.cern.ch/hc/991/test"&gt;http://gangarobot.cern.ch/hc/991/test&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;the other metrics look slightly better. The most noticeable thing, rather than the plain mean values, is the histogram shape of cpu/wall clock time and events/wallclock. They are much healthier with a bell shape instead of a U one. (i.e. especially in the cpu/wall clock we have a more predictable behaviour. In this test the tail of jobs towards zero is drastically reduced). This is only one test and we are still affected by a bad distribution of data in DPM as they are still mostly concentrated on 2 servers over 4. There are also other things we can tweak to optimize access. The next steps to do with the same test (for comparison) are:&lt;br /&gt;&lt;br /&gt;1) Spread the data more evenly on the data servers if we can se04 was hammered for a good while and had load 80-100 for few hours according to nagios.&lt;br /&gt;&lt;br /&gt;2) Increase the number of jobs that can run at the same time&lt;br /&gt;&lt;br /&gt;3) Look at the distribution of jobs on the WN.This might be useful to know how to do it when we will have 8 cores rather than two.&lt;br /&gt;&lt;br /&gt;4) Look at the job distribution in time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-136606633161875470?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/136606633161875470/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=136606633161875470' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/136606633161875470'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/136606633161875470'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2010/03/latest-hc-tests-in-manchester.html' title='Latest HC tests in Manchester'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4308118186354703337</id><published>2009-10-15T11:42:00.014Z</published><updated>2009-10-15T17:03:33.604Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Squid cache for atlas on a 32bit machine</title><content type='html'>I installed the squid cache for atlas on a SL5 32bit machine. There are no rpms from the project in 32bit. There is a default OS squid rpm but it is apparently bugged and the request is to install a 2.7-STABLE7 version. So I got the source rpm from here&lt;br /&gt;&lt;br /&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2"&gt;https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;rpmbuild --rebuild &lt;a href="http://grid-deployment.web.cern.ch/grid-deployment/flavia/frontier-squid-2.7.STABLE7-4.sl5.src.rpm" target="_top"&gt;frontier-squid-2.7.STABLE7-4.sl5.src.rpm&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;it will compile the squid for your system. And create a binary rpm in&lt;br /&gt;&lt;br /&gt;/usr/src/redhat/RPMS/i386/frontier-squid-2.7.STABLE7-4.sl5.i386.rpm&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;rpm -ihv /usr/src/redhat/RPMS/i386/frontier-squid-2.7.STABLE7-4.sl5.i386.rpm&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It will install everything in /home/squid - apparently it is relocatable but I don't mind the location so I left it.&lt;br /&gt;&lt;br /&gt;Edit &lt;span style="font-weight: bold;"&gt;/home/squid/etc/squid.conf&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Not everything you find in the BNL instructions is necessary. Here is my list of changes&lt;br /&gt;&lt;span style="font-size:85%;"&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;acl SUBNET-NAME src SUBNET-IPS &lt;/span&gt;&lt;span&gt;&lt;---- there are different ways to express this&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;http_access allow SUBNET-NAME&lt;br /&gt;hierarchy stoplist cgi-bin ?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;cache_mem 256 MB&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;maximum_object_size_in_memory 128 KB&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;cache_dir ufs /home/squid/var/cache 100000 16 256&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;maximum_object_size 1048576 KB&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;update_headers off&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;cache_log /home/squid/var/logs/cache.log&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;cache_store_log none&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;strip_query_terms off&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;refresh_pattern -i /cgi-bin/ 0        0%      0&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;cache_effective_user squid &lt;/span&gt;&lt;span&gt;&lt;--- default is nobody doesn't have access to /home/squid&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;icp_port 0&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Edit &lt;span style="font-weight: bold;"&gt;/home/squid/sbin/fn-local-squid.sh&lt;/span&gt;&lt;br /&gt;Add these two lines&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;# chkconfig: - 99 21&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;# description: Squid cache startup script&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;then&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-size:85%;"&gt;ln -s /home/squid/sbin/fn-local-squid.sh /etc/init.d/squid&lt;br /&gt;chkconfig --add squid&lt;br /&gt;chkconfig squid on&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Write to Rod Walker to authorize your machine in the gridka Frontier server (until RAL is up that's the server for Europe) if you can set up an alias for the machine do it before writing him.&lt;br /&gt;&lt;br /&gt;To test the setup&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;wget http://frontier.cern.ch/dist/fnget.py&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;export http_proxy=http://YOU-SQUID-CAHE:3128&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;&lt;br /&gt;python fnget.py --url=http://atlassq1-fzk.gridka.de:8021/fzk/Frontier --sql="SELECT TABLE_NAME FROM ALL_TABLES"&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;if you get many lines similar to those below&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;COMP200_F0027_TAGS_SEQ&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;COMP200_F0037_IOVS_SEQ&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;COMP200_F0020_IOVS_SEQ&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;your cache is working.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4308118186354703337?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4308118186354703337/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4308118186354703337' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4308118186354703337'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4308118186354703337'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/10/squid-cache-32bit.html' title='Squid cache for atlas on a 32bit machine'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6175977911467682290</id><published>2009-08-19T11:23:00.003Z</published><updated>2009-08-19T11:38:36.842Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester BDII CE'/><title type='text'>Manchester update</title><content type='html'>I fixed the site bdii problem i.e. the site static information 'disappeared'. It didn't actually disappear it was just declared under mds-vo-name=resource instead of mds-vo-name=UKI-NORTHGRID-MAN-HEP AND THEREFORE GSTAT COULDN'T FIND IT. This was due to rgma and site bdii conflict. The rgma bdii (that didn't exist in very old versions) needs to be declared in the BDII_REGIONS in YAIM. I knew it but forgot completely I already fixed it when I reinstalled the machine few months ago so I spent a delightful afternoon parsing ldif files and ldap output, hacked the ldif, sort of fixed it and then asked for a proper solution. So... here we go I'm writing it down this time so I can google for myself. On the positive side I upgraded now to the latest version both site and top bdiii and the resource bdii on the CEs. So we now have shiny new attributes like Spec2006 &amp;amp;Co.&lt;br /&gt;&lt;br /&gt;I also upgraded the CEs trying to fix our random instability problem which afflicts us. However I upgraded online without reinstalling everything and it makes me a bit nervous thinking that some files that needed change might have not been edited because they already exist. So I will completely reinstall the CEs starting with ce01 today.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6175977911467682290?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6175977911467682290/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6175977911467682290' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6175977911467682290'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6175977911467682290'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/08/manchester-update.html' title='Manchester update'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3501646247591867142</id><published>2009-05-05T15:42:00.006Z</published><updated>2009-05-05T16:05:08.362Z</updated><title type='text'>Howto publish Users DNs accounting records</title><content type='html'>To publish the User DN records in the accounting you should add to your site-info.def the following line&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;APEL_PUBLISH_USER_DN="yes"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;and reconfigure your MON box. This will change the parser configuration file&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;/opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;replacing this line&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;joinprocessor publishglobalusername="no"&gt;&lt;/joinprocessor&gt;&lt;/span&gt;   &lt;span style="font-weight: bold;"&gt;&lt;joinprocessor publishglobalusername="no"&gt;&lt;/joinprocessor&gt;&lt;/span&gt;  &lt;span style="font-weight: bold;"&gt;&amp;lt;JoinProcessor publishGlobalUserName="no"&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;with this one&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;joinprocessor publishglobalusername="yes"&gt;&lt;/joinprocessor&gt;&lt;/span&gt;   &lt;span style="font-weight: bold;"&gt;&lt;joinprocessor publishglobalusername="yes"&gt;&lt;/joinprocessor&gt;&lt;/span&gt;&lt;span class="on" id="formatbar_PreviewAction" title="Preview" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);toggle();ButtonMouseDown(this);"&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;  &amp;lt;JoinProcessor publishGlobalUserName="yes"&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;this will affect only on the new records.&lt;br /&gt;&lt;br /&gt;If you want to republish everything you need to replace in the same file&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt; &amp;lt;Republish&amp;gt;missing&amp;lt;/Republish&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;with this line using the apropriate dates&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;republish recordstart="2006-02-01" recordend="2006-04-25"&gt;    &lt;/republish&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;republish recordstart="2006-02-01" recordend="2006-04-25"&gt;&lt;/republish&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&amp;lt;Republish recordStart="2006-02-01" recordEnd="2006-04-25"&amp;gt;gap&amp;lt;/Republish&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;and publish a chunk of data at the time. The documentation suggests one month at the time to avoid to run out of memory. When you have finished put back the line&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;republish&gt;&lt;/republish&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt; &amp;lt;Republish&amp;gt;missing&amp;lt;/Republish&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3501646247591867142?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3501646247591867142/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3501646247591867142' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3501646247591867142'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3501646247591867142'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/05/howto-publish-users-dns-accounting.html' title='Howto publish Users DNs accounting records'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-906325256879717498</id><published>2009-04-29T15:56:00.003Z</published><updated>2009-04-29T16:14:41.493Z</updated><title type='text'>NFS bug in older SL5 kernel</title><content type='html'>As mentioned previously ( http://northgrid-tech.blogspot.com/2009/03/replaced-nfs-servers.html ) we have recently upgraded our NFS servers and they now run on SL5. Shortly after going into production all LHCb jobs stalled at Manchester and we were blacklisted by the VO.&lt;br /&gt;&lt;br /&gt;We were advised that it may be a lockd error, and asked to use the following python code to diagnose this:&lt;br /&gt;&lt;br /&gt;&lt;span class="solution"&gt;-------------------------------------------------------------------&lt;br /&gt;import fcntl&lt;br /&gt;fp = open("lock-test.txt", "a")&lt;br /&gt;fcntl.lockf(fp.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)&lt;br /&gt;-------------------------------------------------------------------&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The code did not give any errors and we therefore discounted this as the problem. Wind the clock on a fortnight (including a week's holiday over Easter) and we still have not found the problem so I tried the above code again, and bingo lockd was the problem. A quick search of the SL mailing list pointed me to this kernel bug&lt;br /&gt;https://bugzilla.redhat.com/show_bug.cgi?id=459083&lt;br /&gt;&lt;br /&gt;A quick update of the kernel and reboot and the problem was fixed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-906325256879717498?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/906325256879717498/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=906325256879717498' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/906325256879717498'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/906325256879717498'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/04/nfs-bug-in-older-sl5-kkernel.html' title='NFS bug in older SL5 kernel'/><author><name>Jimmy Cullen</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://lh3.google.com/image/jimmy.cullen/RVb4_taaABI/AAAAAAAAAe8/tQ6Xol7KyQ0/SA400028.JPG?imgmax=512'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4393511879205941284</id><published>2009-04-03T12:50:00.008Z</published><updated>2009-04-03T13:15:07.145Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Fixed MPI installation</title><content type='html'>Few months ago we installed MPI using glite packages and YAIM.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://northgrid-tech.blogspot.com/2008/11/mpi-enabled.html"&gt;http://northgrid-tech.blogspot.com/2008/11/mpi-enabled.html&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;We never really tested it though until now. We have found few problems with YAIM:&lt;br /&gt;&lt;br /&gt;YAIM creates an mpirun script that assumes ./ is in the path so the job was landing on WN but mpirun couldn't find the user script/executable. I corrected it prepending `pwd`/ in front of the script arguments at the end of the sript so it runs `pwd`/$@ instead of $@. I added this using yaim post functionality.&lt;br /&gt;&lt;br /&gt;The if else statement that if used to build MPIEXEC_PATH is written in a contorted way and needs to be corrected. For example:&lt;br /&gt;&lt;br /&gt;1) MPI_MPIEXEC_PATH is used in the if but YAIM doesn't write it in any system file that sets the env variable like grid-env.sh where the other MPI_* variable are set.&lt;br /&gt;&lt;br /&gt;2) In the else statement there is an hardcoded path which atcually is chosen splitting the mpiexec executable MPI_MPICH_MPIEXEC points to from its directory.&lt;br /&gt;&lt;br /&gt;3) YAIM doesn't rewrite mpirun once it's written so the hardcoded path can't be changed reconfiguring the node without manually deleting mpirun before. This make difficult to update or correct mistakes.&lt;br /&gt;&lt;br /&gt;4) The existence of MPIEXEC_PATH is not checked and it should.&lt;br /&gt;&lt;br /&gt;Anyway eventually we managed to run mpi jobs and we reported to the new TMB MPI working group what we have done because another site was experiencing the same problems. Hopefully they will correct these problems. Special thanks go to Chris Glasman who hunted down the inital problem with the path and patiently tested the changes we applied.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4393511879205941284?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4393511879205941284/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4393511879205941284' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4393511879205941284'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4393511879205941284'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/04/fixed-mpi-installation.html' title='Fixed MPI installation'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2234922245811652604</id><published>2009-03-25T14:53:00.003Z</published><updated>2009-03-25T15:02:53.500Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>New Storage and atlas space tokens</title><content type='html'>We have finally installed all the units. They are ~84TB of usable space. 42TB are dedicated to atlas space tokens, the other 42TB are shared for now but will be moved into atlas space tokens when we see more usage.&lt;br /&gt;&lt;br /&gt;We also have finally enabled all the space tokens requested by atlas. They are waiting to be inserted in Tier Of Atlas but below I report what we publish in the BDII.&lt;br /&gt;&lt;br /&gt;ldapsearch -x -H ldap://site-bdii.tier2.hep.manchester.ac.uk:2170 -b o=grid '(GlueSALocalID=atlas*)' GlueSAStateAvailableSpace GlueSAStateUsedSpace| grep Glu&lt;br /&gt;dn: GlueSALocalID=atlas,GlueSEUniqueID=dcache01.tier2.hep.manchester.ac.uk,mds&lt;br /&gt;GlueSAStateAvailableSpace: 33411318000&lt;br /&gt;GlueSAStateUsedSpace: 21533521683&lt;br /&gt;dn: GlueSALocalID=atlas,GlueSEUniqueID=dcache02.tier2.hep.manchester.ac.uk,mds&lt;br /&gt;GlueSAStateAvailableSpace: 48171274000&lt;br /&gt;GlueSAStateUsedSpace: 4168774302&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASGROUPDISK:online,GlueSEUniqueID=bohr3223.tier2.he&lt;br /&gt;GlueSAStateAvailableSpace: 1610612610&lt;br /&gt;GlueSAStateUsedSpace: 125&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASPRODDISK:online,GlueSEUniqueID=bohr3223.tier2.hep&lt;br /&gt;GlueSAStateAvailableSpace: 2154863252&lt;br /&gt;GlueSAStateUsedSpace: 44160003&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASSCRATCHDISK:online,GlueSEUniqueID=bohr3223.tier2.&lt;br /&gt;GlueSAStateAvailableSpace: 3298534820&lt;br /&gt;GlueSAStateUsedSpace: 62&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASLOCALGROUPDISK:online,GlueSEUniqueID=bohr3223.tie&lt;br /&gt;GlueSAStateAvailableSpace: 3298534883&lt;br /&gt;GlueSAStateUsedSpace: 0&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASDATADISK:online,GlueSEUniqueID=bohr3223.tier2.hep&lt;br /&gt;GlueSAStateAvailableSpace: 8052760941&lt;br /&gt;GlueSAStateUsedSpace: 302738&lt;br /&gt;dn: GlueSALocalID=atlas,GlueSEUniqueID=bohr3223.tier2.hep.manchester.ac.uk,mds&lt;br /&gt;GlueSAStateAvailableSpace: 28580000000&lt;br /&gt;GlueSAStateUsedSpace: 709076288&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASMCDISK:online,GlueSEUniqueID=bohr3223.tier2.hep.m&lt;br /&gt;GlueSAStateAvailableSpace: 3298534758&lt;br /&gt;GlueSAStateUsedSpace: 125&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASPRODDISK:online,GlueSEUniqueID=bohr3226.tier2.hep&lt;br /&gt;GlueSAStateAvailableSpace: 2199023130&lt;br /&gt;GlueSAStateUsedSpace: 125&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASLOCALGROUPDISK:online,GlueSEUniqueID=bohr3226.tie&lt;br /&gt;GlueSAStateAvailableSpace: 3298534883&lt;br /&gt;GlueSAStateUsedSpace: 0&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASGROUPDISK:online,GlueSEUniqueID=bohr3226.tier2.he&lt;br /&gt;GlueSAStateAvailableSpace: 1610612610&lt;br /&gt;GlueSAStateUsedSpace: 125&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASMCDISK:online,GlueSEUniqueID=bohr3226.tier2.hep.m&lt;br /&gt;GlueSAStateAvailableSpace: 3298534758&lt;br /&gt;GlueSAStateUsedSpace: 125&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASDATADISK:online,GlueSEUniqueID=bohr3226.tier2.hep&lt;br /&gt;GlueSAStateAvailableSpace: 8053063554&lt;br /&gt;GlueSAStateUsedSpace: 125&lt;br /&gt;dn: GlueSALocalID=atlas:ATLASSCRATCHDISK:online,GlueSEUniqueID=bohr3226.tier2.&lt;br /&gt;GlueSAStateAvailableSpace: 3298534820&lt;br /&gt;GlueSAStateUsedSpace: 62&lt;br /&gt;dn: GlueSALocalID=atlas,GlueSEUniqueID=bohr3226.tier2.hep.manchester.ac.uk,mds&lt;br /&gt;GlueSAStateAvailableSpace: 35730390000&lt;br /&gt;GlueSAStateUsedSpace: 1172758&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2234922245811652604?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2234922245811652604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2234922245811652604' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2234922245811652604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2234922245811652604'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/03/new-storage-and-atlas-space-tokens.html' title='New Storage and atlas space tokens'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7036529689190416924</id><published>2009-03-24T14:41:00.004Z</published><updated>2009-03-24T16:14:59.465Z</updated><title type='text'>Replaced NFS servers</title><content type='html'>The NFS servers have been replaced in Manchester with two more powerful machines and two 1TB raided SATA disks. This should hopefully put a stop to the space problems we have suffered in the past few months both with atlas and lhcb and should also allow us to keep a bit more releases than before.&lt;br /&gt;&lt;br /&gt;We also have a nice nagios graphs to monitor the space now as well as cfengine alerts.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://tinyurl.com/d5n7eo"&gt;http://tinyurl.com/d5n7eo&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7036529689190416924?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7036529689190416924/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7036529689190416924' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7036529689190416924'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7036529689190416924'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/03/replaced-nfs-servers.html' title='Replaced NFS servers'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4064628514156094700</id><published>2009-03-05T08:40:00.006Z</published><updated>2009-03-05T08:59:56.591Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='machine room'/><title type='text'>Machine room update</title><content type='html'>After some sweet-talking we managed to get two extra air-con units installed in our old machine room. This room houses our 2005 cluster and our more recent CPU and storage purchased last year. The extra cooling was noticeable and allowed us to switch on a couple of racks which were otherwise offline.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_mw1Yh1GNWVc/Sa-QVOlqWsI/AAAAAAAAAYo/ywd7R7Vp2QY/s1600-h/temp.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 381px; height: 154px;" src="http://4.bp.blogspot.com/_mw1Yh1GNWVc/Sa-QVOlqWsI/AAAAAAAAAYo/ywd7R7Vp2QY/s400/temp.png" alt="" id="BLOGGER_PHOTO_ID_5309621180060818114" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In other news, the new data centre is coming along nicely and will be ready for handover in 3/4 months from now. If you're ever racing past Lancaster on the M6 you'll get a good view of the Borg mothership on the hill, the sleak black cladding is going up now...&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_mw1Yh1GNWVc/Sa-T-u6jgFI/AAAAAAAAAY4/239E75Ug3ig/s1600-h/lancs-iss.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 245px;" src="http://3.bp.blogspot.com/_mw1Yh1GNWVc/Sa-T-u6jgFI/AAAAAAAAAY4/239E75Ug3ig/s320/lancs-iss.png" alt="" id="BLOGGER_PHOTO_ID_5309625191647903826" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4064628514156094700?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4064628514156094700/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4064628514156094700' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4064628514156094700'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4064628514156094700'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/03/machine-room-update.html' title='Machine room update'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_mw1Yh1GNWVc/Sa-QVOlqWsI/AAAAAAAAAYo/ywd7R7Vp2QY/s72-c/temp.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1997546283098976296</id><published>2009-02-13T14:15:00.002Z</published><updated>2009-02-13T15:05:43.852Z</updated><title type='text'>This week's DPM troubles at Lancaster.</title><content type='html'>We've had some interesting time this week in Lancaster, a tale of Gremlins, Greedy Daemons and Magical Faeries who come in the night and fix your DPM problems. &lt;br /&gt;&lt;br /&gt;On Tuesday evening, when we've all gone home for the night, the DPM srmv1 daemon (and to a lesser extent the srmv2.2 and dpm daemons) started gobbling up system resources, sending our headnode into a swapping frenzy. There are known memory leak problems in the DPM code, and we've been victim of them before but in those instances we've always been saved by a swift restart of the affected services and the worse that happened was a sluggish DPM. This time the DPM servies completely froze up, and around 7 pm we started failing tests.&lt;br /&gt;&lt;br /&gt;So coming into this disaster on Wednesday morning we leaped into action. Restarting the services fixed the load on the headnode, but the DPM still wouldn't work. Checking the logs showed that all requests were being queued, apparently forever. The trail led to some error messages in the mysqld.log;&lt;br /&gt;&lt;br /&gt;090211 12:05:37 [ERROR] /usr/libexec/mysqld: Lock wait timeout exceeded;&lt;br /&gt;try restarting transaction&lt;br /&gt;090211 12:05:37 [ERROR] /usr/libexec/mysqld: Sort aborted&lt;br /&gt;&lt;br /&gt;The oracle Google pointed that these kind of errors were indicative of a mysql server in a bad state after suddenly loosing connection to a client but not accounting for this. Various restarts, reboots and threats were used, but nothing would get the dpm working and we had to go into downtime.&lt;br /&gt;&lt;br /&gt;Rather then dive blindly into the bowels of the DPM backend mysql we got in contact with the DPM developers on the DPM support list. They were really quick to respond, and after recieving 40MB of  (zipped!) log files from us set to work developing a strategy to fix us. It appears that our mysql had grown much larger then it should have, "bloating" with historical data, which contributed to it getting into a bad state and made the task of repairing the database harder- partly as we simply couldn't restore from backups as these too would be "bloated".&lt;br /&gt;&lt;br /&gt;After a while of bashing our heads, scouring logs and waiting for news from the DPM chaps we decided to make use of the downtime and upgrade the RAM on our headnode to 4 GB (from 2), a task we had been saving for the scheduled downtime when we finally upgrade to the Holy Grail that is DPM 1.7.X. So we slapped in the RAM, brung the machine up clearly, and left it.&lt;br /&gt;&lt;br /&gt;A bit over an hour after it came up after the upgrade the headnode started working again. As if by magic. Nothing notable in the logs, it just started working again. The theory is that the added RAM allowed the mysql to chug through a backlog of requests and start working again. But that's just speculation. The dpm chaps are still puzzling over what happened, and our databases are still bloated, but the crisis (for now).&lt;br /&gt;&lt;br /&gt;So there are 2 morals to this tale;&lt;br /&gt;1) I wouldn't advise running a busy DPM headnode with less then 4GB of RAM, it leads to unpredictable behaviour.&lt;br /&gt;2) If you get stuck in an Unscheduled Downtime you might as well make use of it to do any work, you never know when something magical might happen!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1997546283098976296?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1997546283098976296/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1997546283098976296' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1997546283098976296'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1997546283098976296'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/02/this-weeks-dpm-troubles-at-lancaster.html' title='This week&apos;s DPM troubles at Lancaster.'/><author><name>Matt</name><uri>http://www.blogger.com/profile/16514316941712112895</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1063735652524383073</id><published>2009-02-09T07:38:00.003Z</published><updated>2009-02-09T08:04:44.704Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='jobmanager'/><title type='text'>Jobmanager pbsqueue cache locked</title><content type='html'>Spent last week tracking down a problem where jobs were finishing in the batch system but the jobmanager wasn't recognizing this. This meant that jobs never 'completed' which had two major impacts, 1. &lt;a href="http://pprc.qmul.ac.uk/%7Elloyd/gridpp/atest.html"&gt;Steve's test jobs&lt;/a&gt; all failed through timeouts and 2. Atlas production stopped because it looked like the pilots never completed and no further pilots were sent.&lt;br /&gt;&lt;br /&gt;Some serious detective work was undertaken my Maarten and Andrey and it turned out the pbscache wasn't being updated due to a stale lock file in the ~/.lcgjm/ directory. The lock files can be found with this on the CE:&lt;br /&gt;&lt;br /&gt;find /home/*/.lcgjm/pbsqueue.cache.proc.{hold.localhost.*,locked} -mtime +7 -ls&lt;br /&gt;&lt;br /&gt;We had 6 users affected (alas, our important ones!), all with lock files dated Dec 22. Apparently the lcgpbs script Helper.pm would produce these whenever hostname returned 'localhost'. Yes, on December 22 we had &lt;a href="https://goc.gridops.org/downtime/list?id=15505351"&gt;maintenance work&lt;/a&gt; with DHCP unavailable, and for some brief period the CE hostname was 'localhost'. Note this is lcg-CE under glite-3.1. Happy days are here again!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1063735652524383073?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1063735652524383073/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1063735652524383073' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1063735652524383073'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1063735652524383073'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2009/02/jobmanager-pbsqueue-cache-locked.html' title='Jobmanager pbsqueue cache locked'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1293508126716009727</id><published>2008-12-17T22:35:00.012Z</published><updated>2009-02-03T16:20:17.332Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester General Changes</title><content type='html'>* Enabled users pilots for Atlas and Lhcb. Currently Lhcb is running a lot of jobs and although most is production many are from their generic users. Atlas instead seems to have almost disappeared.&lt;br /&gt;&lt;br /&gt;* Enabled NGS VO and passed the first tests. Currently in the conformance test week.&lt;br /&gt;&lt;br /&gt;* Enabled one shared queue and completely phased out the VO queues. This has required a transition period for some VOs to give the time to clear the jobs from the old queue and or to reconfigure their tools. This has greatly simplified the maintainance.&lt;br /&gt;&lt;br /&gt;* Installed a top-level BDII and reconfigured the nodes to query the local top level BDII instead of the RAL one. This was actually quite easy and we should have done it earlier.&lt;br /&gt;&lt;br /&gt;* Cleaned up old parts of cfengine that were causing the servers to be overloaded, not serve the nodes correctly and fire off thousands of emails a day. Mostly this was due to an overlap in the way cfexecd was run both as a cron job and as a daemon. However we also increased TimeOut and SplayTime values and introduced explicitely the schedule parameter in cfagent.conf. Since then cfengine hasn't had anymore problems.&lt;br /&gt;&lt;br /&gt;* Increased usage of YAIM local/post functions to apply local overrides or minor corrections to yaim default. Compared to inserting the changes in cfengine this method has the benefit of being integrated and predictable. When we run yaim the changes are applied immediately and don't get overridden.&lt;br /&gt;&lt;br /&gt;* New storage: our room is full and when the clusters are loaded we hit the power/cooling limit and we risk to draw power from other rooms, due to this problem the CC people don't want us to switch on new equipment without switching off some old one to maitain the balance in the power consumption. So eventually we have bought 96 TB of raw space to get us going. The kit has arrived yesterday and needs to be installed in the rack we have and power measures need to be taken  to avoid switching off more than the necessary amount of nodes. Luckily it will not be many anyway (even taking the nominal value on the back of the new machines it would be 8 nodes but with better power measures it could be as many as 4) because the new machines consume much less then the DELL nodes that are now 4 years old. However buying new CPUs/storage cannot be done without switching off a significant fraction of the current CPUs before switching on the new kit and requires working in tight cooperation with CCS people which has now been agreed after a meeting I had last week with them and their management.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1293508126716009727?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1293508126716009727/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1293508126716009727' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1293508126716009727'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1293508126716009727'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/12/manchester-general-changes.html' title='Manchester General Changes'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-834097829133213801</id><published>2008-12-16T20:49:00.009Z</published><updated>2008-12-16T21:36:28.349Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Phasing out VO queues</title><content type='html'>I've started to phase out VO queues and to create VO shared queues. The plan is eventually to have 4 queues called with a leap of imagination short, medium, long and test with the following characteristics:&lt;br /&gt;&lt;br /&gt;test: 3h/4h; all local VOs&lt;br /&gt;short: 6h/12h; ops,lhcbsgm,atlasgm&lt;br /&gt;medium: 12h/24h all VOs and roles but those in short queue and production&lt;br /&gt;long: 24h/48h all VOs and roles but those that can access the short queue&lt;br /&gt;&lt;br /&gt;Installing the queues, adding the groups ACLs and publishing them is not difficult. YAIM (glite-yaim-core-4.0.4-1, glite-yaim-lcg-ce-4.0.4-2 or higher) can do it for you. Otherwise it can be done by hand which is still easy but is more difficult to maintain (the risk to override is always high and files need to be maintained in cfengine or cvs or else).&lt;br /&gt;&lt;br /&gt;The problem for me is that this scheme works only if the users select the correct ACLs and a suitable queue with the right length for their jobs in their JDL. If they don't the queue chosen by the WMS is random with high probability of jobs failing because they end up in a queue that is too short or into a queue that doesn't have the right ACLs. So I'm not sure if it's really a good idea even if it is much easier to maintain and allows a bit more sophisticated setups.&lt;br /&gt;&lt;br /&gt;Anyway if you do it by YAIM all you have to do is to add the queue to&lt;br /&gt;&lt;br /&gt;QUEUES="my-new-queue other-queues"&lt;br /&gt;&lt;br /&gt;add the right VO/FQAN to the new queue _GROUP_ENABLE variable (remember to convert . and - into _&lt;br /&gt;&lt;br /&gt;MY_NEW_QUEUE_GROUP_ENABLE="atlas /atlas/ROLE=pilot other-vos-or-fqans"&lt;br /&gt;&lt;br /&gt;the syntax of GROUP_ENABLE has to be the same as the one you have used in group.conf (see previous post &lt;a href="http://northgrid-tech.blogspot.com/2008/12/groupsconf-syntax.html"&gt;http://northgrid-tech.blogspot.com/2008/12/groupsconf-syntax.html&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;And finally add to site-info.def&lt;br /&gt;&lt;br /&gt;FQANVOVIEWS=yes&lt;br /&gt;&lt;br /&gt;to enable publishing of the ACL in the GIP.&lt;br /&gt;&lt;br /&gt;Rerun YAIM on the CE as normal.&lt;br /&gt;&lt;br /&gt;To check everything is ok on the CE&lt;br /&gt;&lt;br /&gt;qmgr -c 'p q my-new-queue'&lt;br /&gt;&lt;br /&gt;ldapsearch -x -H ldap://MY-CE.MY-DOMAIN:2170 -b GlueCEUniqueID=MY-CE.MY-DOMAIN:2119/jobmanager-lcgpbs-my-new-queue,Mds-Vo-name=resource,o=grid&lt;br /&gt;&lt;br /&gt;among other things, if correctly configured it should list the GlueCEAccessControlBaseRules for each VO and FQAN you have listed in _GROUP_ENABLE.&lt;br /&gt;&lt;br /&gt;If a&lt;br /&gt;GlueCEAccessControlBaseRule: DENY:FQAN field appears that's the ACL  for VOViews not the access to the queue.&lt;br /&gt;&lt;br /&gt;Thanks to Steve and Maria for pointing me to the right combination of YAIM packages and confirming the randomness WMS matchmaking.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-834097829133213801?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/834097829133213801/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=834097829133213801' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/834097829133213801'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/834097829133213801'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/12/phasing-out-vo-queues.html' title='Phasing out VO queues'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-541565218566980458</id><published>2008-12-15T16:29:00.011Z</published><updated>2008-12-15T17:45:03.978Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>groups.conf syntax</title><content type='html'>Elena asked about it few days ago on TB-SUPPORT. Today I investigated a bit further and the result is that for glite-yaim-core versions &gt;&lt;span id="hidsubpartcontentdiscussion"&gt;4.0.4-1:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* Even if it still works the syntax with VO= and GROUP= is obsolete. The new syntax is much simpler as it uses directly the FQANs as reported in the VO cards (if they are maintained).&lt;br /&gt;&lt;br /&gt;* The syntax in /opt/glite/yaim/examples/groups.conf.example is correct and the files in the directory are kept up to date with the correct syntax although the examples might not be valid.&lt;br /&gt;&lt;br /&gt;* Further information can be found either in&lt;br /&gt;&lt;br /&gt;/opt/glite/examples/groups.conf.README&lt;br /&gt;&lt;br /&gt;or&lt;br /&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Group_configuration_in_YAIM"&gt;&lt;br /&gt;&lt;/a&gt;&lt;span style="text-decoration: underline;"&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Group_configuration_in_YAIM"&gt;https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#Group_configuration_in_YAIM&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;which is worth to periodically review for changes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-541565218566980458?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/541565218566980458/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=541565218566980458' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/541565218566980458'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/541565218566980458'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/12/groupsconf-syntax.html' title='groups.conf syntax'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5897519016438381877</id><published>2008-12-01T15:52:00.002Z</published><updated>2008-12-01T16:25:31.493Z</updated><title type='text'>RFIO tuning for Atlas analysis jobs</title><content type='html'>A little info about the RFIO settings we've tested at Liverpool.&lt;br /&gt;&lt;br /&gt;Atlas analysis jobs running on a site using DPM use POSIX access through the RFIO interface. ROOT (since v5.16 IIRC) has support for RFIO access and uses the buffered access mode READBUF. This allocates a static buffer for files read via RFIO on the client. By default this buffer is 128kB.&lt;br /&gt;&lt;br /&gt;Initial tests with this default buffer size showed a low cpu efficiency and a high rate of bandwidth usage, far more than the size of the files being accessed. The buffer size can be altered by including a file on the client called /etc/shift.conf containing&lt;br /&gt;&lt;br /&gt;RFIO IOBUFSIZE XXX&lt;br /&gt;&lt;br /&gt;where XXX is the size in bytes. Altering this setting gave the following results&lt;br /&gt;&lt;br /&gt;Buffer (MB), CPU (%), Data transferred (GB)&lt;br /&gt;0.125, 60.0, 16.5&lt;br /&gt;1.000, 23.0, 65.5&lt;br /&gt;10.00, 13.5, 174.0&lt;br /&gt;64.00, 62.1, 11.5&lt;br /&gt;128.0, 74.7, 7.5&lt;br /&gt;&lt;br /&gt;This was on a test data set with file sizes of ~1.5GB and using athena 14.2.10.&lt;br /&gt;&lt;br /&gt;Using buffer sizes of 64MB+ gives gains in efficiency and required bandwidth. A 128MB buffer is a significant chunk of a worker node's RAM, but as the files are not being cached in the linux file cache the ram usage is likely similar to accessing the file from local disk, and the gains are large.&lt;br /&gt;&lt;br /&gt;For comparison the same test was run from a copy of the files on local disk. This gave a cpu efficiency of ~50% but the event rate was ~8 times slower than when using RFIO.&lt;br /&gt;&lt;br /&gt;My conclusions are that RFIO buffering is significantly more efficient than standard linux file caching. The default buffer size is insufficient and increasing by small amounts greatly reduces efficiency. Increasing the buffer to 64-128MB gives big gains without impacting available RAM too much.&lt;br /&gt;&lt;br /&gt;My guess about why only a big buffer gives gains may be due to the random access on the file by the analysis job. Reading in a small chunk, eg 1MB, may buffer a whole event but the next event is unlikely to be in that buffered 1MB, so another 1MB has to be read in for the next event. Similarly for 10MB, although this time the amount read in each time is 10x as much but with a less than 10x increase in probability of the event being in the buffer. When the buffer reaches 64MB the probability of an event being in the buffered area is high enough to offset the extra data being read in.&lt;br /&gt;&lt;br /&gt;Another possibility is that the buffering only buffers the first xMB of the file, hence a bigger buffer means more of the file is in RAM and there's a higher probability of the event being in the buffer. Neither of these hypotheses have been investigated further yet.&lt;br /&gt;&lt;br /&gt;Large block reads are also more efficient when reading in the data than lots of small random reads. The efficiency effectively becomes 100% if the buffer size is &gt;= the dataset file size; the first reads pull in all of the file and all reads from then are from local RAM.&lt;br /&gt;&lt;br /&gt;This makes no difference to the impact on the head node for eg SURL/TURL requests, only the efficiency of the analysis job accessing the data from the pool nodes and the required bandwidth (our local tests simply used the rfio:///dpm/... path directly). If there are enough jobs there will still be bottle necks on the network, either at switch or pool node. We have given all our pool nodes at least 3Gb/s connectivity to the LAN backbone.&lt;br /&gt;&lt;br /&gt;The buffer size setting will give different efficiency gains for different file sizes (ie the smaller filesize, the better the efficiency), eg the first atlas analysis test had smaller file sizes than our tests and showed much higher efficiencies. The impact of the BUFSIZE setting on other VO's analysis jobs that use RFIO hasn't been tested.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5897519016438381877?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5897519016438381877/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5897519016438381877' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5897519016438381877'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5897519016438381877'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html' title='RFIO tuning for Atlas analysis jobs'/><author><name>John Bland</name><uri>http://www.blogger.com/profile/16051241409269392358</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5465862416993566456</id><published>2008-11-28T17:07:00.003Z</published><updated>2008-11-28T17:15:27.683Z</updated><title type='text'>Nagios Checker</title><content type='html'>Nagios checker is a firefox plugin that can replace nagios email alerts with colourful, blinking and possibly noisy icons at the bottom of your firefox window.&lt;br /&gt;&lt;br /&gt;The icon expands into prolematic hosts lines when the cursor goes on top of it and with the right permissions if you click on the hosts it goes to the hosts nagios page.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://addons.mozilla.org/en-US/firefox/addon/3607"&gt;https://addons.mozilla.org/en-US/firefox/addon/3607&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;To configure it to read NorthGrid nagios:&lt;br /&gt;&lt;br /&gt;* Go to settings (right-click on the nagios icon on firefox)&lt;br /&gt;&lt;br /&gt;* Click Add new&lt;br /&gt;&lt;br /&gt;* In the General tab:&lt;br /&gt;** Name: whatever you like&lt;br /&gt;** WEB URL: https://niles004.tier2.hep.manchester.ac.uk/nagios&lt;br /&gt;** Tick the nagios older than 2.0 box&lt;br /&gt;** User name: your DN&lt;br /&gt;** Status script URL: https://niels004.tier2.hep.manchester.ac.uk/nagios/cgi-bin/status.cgi&lt;br /&gt;** Click Ok&lt;br /&gt;&lt;br /&gt;If you want only your site machines&lt;br /&gt;&lt;br /&gt;* Go to the filters tab and&lt;br /&gt;** Tick the 'Hosts matching regular expressions' box&lt;br /&gt;** Insert your domain name in the test-box&lt;br /&gt;** Tick the reverse expression box.&lt;br /&gt;** Click ok&lt;br /&gt;&lt;br /&gt;For the rest you can adjust it as you please I removed the noises and inserted a 3600 sec refresh interval.&lt;br /&gt;&lt;br /&gt;Drawbacks if you add other nagios end-points:&lt;br /&gt;&lt;br /&gt;* Settings are applied to all of them&lt;br /&gt;* If the host names in your other nagios have a different domain name (or don't have it at all) they don't get filtered.&lt;br /&gt;&lt;br /&gt;Perhaps another method might be needed. Investigating.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5465862416993566456?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5465862416993566456/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5465862416993566456' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5465862416993566456'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5465862416993566456'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/nagios-checker.html' title='Nagios Checker'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4029168553035797789</id><published>2008-11-28T10:12:00.002Z</published><updated>2008-11-28T10:14:55.236Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester and Lhcb</title><content type='html'>Manchester is now officially part of Lhcb and all they CPU hours will have full weight!! Yuppieee!! :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4029168553035797789?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4029168553035797789/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4029168553035797789' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4029168553035797789'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4029168553035797789'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/manchester-and-lhcb.html' title='Manchester and Lhcb'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6806814383678824476</id><published>2008-11-27T13:07:00.006Z</published><updated>2008-11-27T13:31:21.915Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Black holes detection</title><content type='html'>Finding out black holes is always a pain... However the pbs accounting records can be of help. A simple script that counts the number of jobs a node swallows makes some difference:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/jobs/black-holes-finder.sh"&gt;http://www.sysadmin.hep.ac.uk/svn/fabric-management/torque/jobs/black-holes-finder.sh&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;I post it just in case other people are interested.&lt;br /&gt;&lt;br /&gt;An example of the output:&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;br /&gt;# black-holes-finder.sh&lt;br /&gt;&lt;br /&gt;Using accounting file 20081127&lt;br /&gt;[...]&lt;br /&gt;bohr5029: 1330&lt;br /&gt;bohr5030: 1803&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;clearly the two nodes above have a problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6806814383678824476?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6806814383678824476/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6806814383678824476' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6806814383678824476'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6806814383678824476'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/black-holes-detection.html' title='Black holes detection'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1534310708324538788</id><published>2008-11-27T13:01:00.005Z</published><updated>2008-11-27T13:28:50.499Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>MPI enabled</title><content type='html'>Enabled MPI in Manchester using YAIM and the recipe from Stephen Childs I found in the links below:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.grid.ie/mpi/wiki/YaimConfig"&gt;http://www.grid.ie/mpi/wiki/YaimConfig&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.grid.ie/mpi/wiki/SiteConfig"&gt;http://www.grid.ie/mpi/wiki/SiteConfig&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Caveats:&lt;br /&gt;&lt;br /&gt;1) The documentation will move to probably the YAIM official pages&lt;br /&gt;&lt;br /&gt;2) The location of the gip files is now under /opt/glite not /opt/lcg&lt;br /&gt;&lt;br /&gt;3) Scripts DO interfere with the current setup on the WNs if run on their own so you need to reconfigure the whole node (I made the mistake of running only MPI_WN). On the CE instead it's enough to run MPI_CE.&lt;br /&gt;&lt;br /&gt;4) MPI_SUBMIT_FILTER variable in site-info.def is not documented (yet). It enables the section of the scripts that rewrites torque submit_filter script that allocates the correct number of CPUs&lt;br /&gt;&lt;br /&gt;5) Yaim doesn't publish MPICH (yet?) so I had to add the following lines&lt;br /&gt;&lt;br /&gt;GlueHostApplicationSoftwareRunTimeEnvironment: MPICH&lt;br /&gt;GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7&lt;br /&gt;&lt;br /&gt;to /opt/glite/etc/gip/ldif/static-file-Cluster.ldif manually.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1534310708324538788?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1534310708324538788/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1534310708324538788' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1534310708324538788'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1534310708324538788'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/mpi-enabled.html' title='MPI enabled'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5356362286566004875</id><published>2008-11-25T18:04:00.006Z</published><updated>2008-11-25T19:00:21.717Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Regional nagios update</title><content type='html'>I reinstalled the regional nagios with Nagios3 and it works now.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://niels004.tier2.hep.manchester.ac.uk/nagios"&gt;https://niels004.tier2.hep.manchester.ac.uk/nagios&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;As suggested by Steve I'm also trying the nagios checker plugin&lt;br /&gt;&lt;br /&gt;&lt;a href="https://addons.mozilla.org/en-US/firefox/addon/3607"&gt;https://addons.mozilla.org/en-US/firefox/addon/3607&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;instead of the email notification but I still have to configure things properly. At the moment firefox makes some noise every ~30 seconds and there is also a visual alert on the bottom right corner of the firefox window with the number of services in critical state which expands to show the services when the cursor points at it. Really nice. :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5356362286566004875?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5356362286566004875/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5356362286566004875' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5356362286566004875'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5356362286566004875'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/regional-nagios-update.html' title='Regional nagios update'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2383155290235476842</id><published>2008-11-20T16:58:00.004Z</published><updated>2008-11-20T17:36:32.653Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>WMS talk</title><content type='html'>I gave a talk about the WMS for the Manchester users benefit. It might be of interest to other people.&lt;br /&gt;&lt;br /&gt;The talk can be found here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.hep.manchester.ac.uk/computing/tier2/talks/wms-overview-20081120.pdf"&gt;WMS Overview&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2383155290235476842?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2383155290235476842/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2383155290235476842' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2383155290235476842'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2383155290235476842'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/wms-talk.html' title='WMS talk'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-9207868015674713538</id><published>2008-11-10T12:28:00.003Z</published><updated>2008-11-10T12:50:50.833Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><title type='text'>DPM "File Not Found"- but it's right there!</title><content type='html'>Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:&lt;br /&gt;"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"&lt;br /&gt;&lt;br /&gt;However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:&lt;br /&gt;&lt;br /&gt;http://fal-pygrid-17.lancs.ac.uk:8123/ganglia/?r=day&amp;sg=&amp;c=LCG-ServiceNodes&amp;h=fal-pygrid-30.lancs.ac.uk&lt;br /&gt;&lt;br /&gt;The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact  that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-9207868015674713538?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/9207868015674713538/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=9207868015674713538' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9207868015674713538'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/9207868015674713538'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/dpm-file-not-found-but-its-right-there.html' title='DPM &quot;File Not Found&quot;- but it&apos;s right there!'/><author><name>Matt</name><uri>http://www.blogger.com/profile/16514316941712112895</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3626799036014692836</id><published>2008-11-07T13:44:00.009Z</published><updated>2008-11-27T11:17:14.994Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Regional nagios</title><content type='html'>I installed a Regional nagios yesterday it turns out to be actually quite easy and the nagios group being quite helpful. I followed the tutorial given by Steve at EGEE08&lt;br /&gt;&lt;br /&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial"&gt;https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaimTutorial&lt;/a&gt;&lt;br /&gt;&lt;a href="https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcgYaimTutorial"&gt;&lt;/a&gt;&lt;br /&gt;I updated it while I was going along instead of writing a parallel document.&lt;br /&gt;&lt;br /&gt;Below the URL of the test installation. It might get reinstalled few times to test other features in the next few days.&lt;br /&gt;&lt;br /&gt;&lt;a class="moz-txt-link-freetext" href="https://niels004.tier2.hep.manchester.ac.uk/nagios/"&gt;https://niels004.tier2.hep.manchester.ac.uk/nagios&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3626799036014692836?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3626799036014692836/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3626799036014692836' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3626799036014692836'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3626799036014692836'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/11/regional-nagios.html' title='Regional nagios'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3375850374037627173</id><published>2008-08-27T15:39:00.003Z</published><updated>2008-08-27T15:43:34.966Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>DPM in Manchester</title><content type='html'>Manchester has now a fully working DPM with 6 TB. There are 2 space tokens  ATLASPRODDISK and ATLASDATADISK. The service has been added to the GOC and the Information System, the space tokens are published. The errors have been corrected and the system has been passing the SAM tests continuously since yesterday.&lt;br /&gt;&lt;br /&gt;I added some information to the wiki&lt;br /&gt;&lt;br /&gt;&lt;a href="https://www.gridpp.ac.uk/wiki/Manchester_DPM#Atlas_Space_tokens"&gt;https://www.gridpp.ac.uk/wiki/Manchester_DPM#Atlas_Space_tokens&lt;/a&gt;&lt;br /&gt;&lt;a href="https://www.gridpp.ac.uk/wiki/Manchester_DPM#Errors_along_the_path"&gt;https://www.gridpp.ac.uk/wiki/Manchester_DPM#Errors_along_the_path&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3375850374037627173?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3375850374037627173/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3375850374037627173' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3375850374037627173'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3375850374037627173'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/08/dpm-in-manchester.html' title='DPM in Manchester'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4667223938916961785</id><published>2008-07-21T18:02:00.003Z</published><updated>2008-11-27T11:17:36.259Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>DPM space tokens</title><content type='html'>Below the first tests of setting up space tokens on DPM testbed:&lt;br /&gt;&lt;br /&gt;&lt;a href="https://www.gridpp.ac.uk/wiki/Manchester_DPM"&gt;https://www.gridpp.ac.uk/wiki/Manchester_DPM&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4667223938916961785?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4667223938916961785/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4667223938916961785' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4667223938916961785'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4667223938916961785'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/07/dpm-space-tokens.html' title='DPM space tokens'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7457485549298120436</id><published>2008-06-03T13:53:00.002Z</published><updated>2008-06-03T14:04:01.382Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='atlas'/><title type='text'>Slaughtering ATLAS jobs?</title><content type='html'>These heavy ion jobs have generated lots of discussion in various forums. They are ATLAS heavy ion simulations (Pb-Pb) which are being killed in two ways. (1) by the site's batch system if the queue walltime limit is reach. (2) by the atlas pilot because the log file modification time hasn't change in 24 hrs.&lt;br /&gt;&lt;br /&gt;Either way, sites shouldn't worry if they see these, ATLAS production is aware. They're only single event and you might see massive mem usage too &gt; 2G/core. :-)&lt;br /&gt;&lt;br /&gt;&lt;a href="http://steveatcern.blogspot.com/2007/10/wn-meeting-kickoff-last-week.html"&gt;According to Steve&lt;/a&gt;, the new WN should allow jobs to gracefully handle batch kills with a suitable delay between SIGTERM and SIGKILL.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7457485549298120436?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7457485549298120436/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7457485549298120436' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7457485549298120436'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7457485549298120436'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/06/slaughtering-atlas-jobs.html' title='Slaughtering ATLAS jobs?'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-849603627196253240</id><published>2008-05-23T11:07:00.005Z</published><updated>2008-05-23T12:43:16.524Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>buggy glite-yaim-core</title><content type='html'>glite-yaim-core to version &gt;4.0.4-1 doesn't recognise anymore VO_$vo_VOMSES even if set correctly in the vo.d dir. I'm still wondering how the testing is performed. Primary functionalities like completion without self-evident errors seems to be overlooked. Anyway, looking on the bright side... the bug will be fixed in yaim-core 4.0.5-something. In the meantime I had to downgrade&lt;br /&gt;&lt;br /&gt;glite-yaim-core to 4.0.3-13&lt;br /&gt;glite-yaim-lcg-ce to 4.0.2-1&lt;br /&gt;lcg-CE to 3.1.5-0&lt;br /&gt;&lt;br /&gt;which was the combination previously working.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-849603627196253240?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/849603627196253240/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=849603627196253240' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/849603627196253240'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/849603627196253240'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/buggy-glite-yaim-core.html' title='buggy glite-yaim-core'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-248331557944311130</id><published>2008-05-23T11:05:00.003Z</published><updated>2008-05-23T12:43:55.035Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>CE problems with lcas an update</title><content type='html'>lcas/lcmaps they have by default debug level set to 5. Apparently it can be changed setting appropriate env variables for the globus-lcas-lcamaps interface. The variables are actually foreseen in yaim. The errors can be generated by a mistyped DN when a non-VOMS proxy is used. This is a very easy way to generate a DoS attack.&lt;br /&gt;&lt;br /&gt;After dwngrading yaim/lcg-CE I've reconfigured the CE and it seems to be working now. I haven't seen any of the debug messages for now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-248331557944311130?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/248331557944311130/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=248331557944311130' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/248331557944311130'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/248331557944311130'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/ce-problems-with-lcas-update.html' title='CE problems with lcas an update'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1346251213822602221</id><published>2008-05-22T16:43:00.003Z</published><updated>2008-05-22T16:51:06.215Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>globus-gatekeeper weirdness</title><content type='html'>globus-gatekeeper has started to spit out level 4 lcas/lcmaps messages from nowhere at 4 o'clock in the morning. The log file reaches few GBs size in few hours and fills /var/log screwing the CE. I contacted nikhef people for help but haven't received an answer yet. The documentation is not helpful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1346251213822602221?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1346251213822602221/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1346251213822602221' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1346251213822602221'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1346251213822602221'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/globus-gatekeeper-weirdness.html' title='globus-gatekeeper weirdness'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2667008113233412599</id><published>2008-05-19T17:41:00.002Z</published><updated>2008-05-19T17:43:39.931Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>and again</title><content type='html'>we have upgraded to 1.8. At least we can look at the space manager while waiting for a solution to the number of jobs cap and the replica manager not replicating.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2667008113233412599?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2667008113233412599/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2667008113233412599' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2667008113233412599'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2667008113233412599'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/and-again.html' title='and again'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7336199332334205901</id><published>2008-05-15T16:58:00.004Z</published><updated>2008-05-16T11:05:21.444Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Still dcache</title><content type='html'>With Vladimir help Sergey managed to start the replica manager changing  a java option in replica.batch. This is as far as it goes because it still doesn't work, i.e. it  doesn't produce replicas. We have just given Vladimir access to the  testbed.&lt;br /&gt;&lt;br /&gt;It seems Chris Brew is having the same 'cannot run more than 200 jobs'  problem since he upgraded. He sent an email to the dcache user-forum. This makes me think that even if the replica manager might help it will not cure the  problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7336199332334205901?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7336199332334205901/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7336199332334205901' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7336199332334205901'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7336199332334205901'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/with-vladimir-help-sergey-managed-to.html' title='Still dcache'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1395303838817846880</id><published>2008-05-13T14:54:00.002Z</published><updated>2008-05-13T14:57:39.542Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester dcache troubles (2)</title><content type='html'>fnal developers are looking at the replica manager issue. The error lines found in the admindomain log appear also at fnal and don't seem to be a problem there. The search continues...&lt;br /&gt;&lt;br /&gt;In the meantime we have doubled the memory of all dcache head nodes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1395303838817846880?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1395303838817846880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1395303838817846880' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1395303838817846880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1395303838817846880'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/manchester-dcache-troubles-2.html' title='Manchester dcache troubles (2)'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5753927362927708498</id><published>2008-05-12T09:08:00.005Z</published><updated>2008-05-12T09:29:40.595Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><title type='text'>The data centre in the sky</title><content type='html'>Sent adverts out this week, trying to shift our ancient DZero farm which was de-commissioned a few years ago. It had a glorious past with many production records set whilst generating DZero's RunII simulation data. It's dual 700MHz PentiumIII CPUs, with 1G RAM can't cope with much these days, and it's certainly not worth the manpower keeping them online. &lt;a href="http://www.hep.lancs.ac.uk/~love/advert.html"&gt;Here is the advert&lt;/a&gt; if you're interested.&lt;br /&gt;&lt;br /&gt;In other news, our MON box system disk spat it's dummy over the weekend, this was one of the three gridpp machines, not bad going after 4 years.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5753927362927708498?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5753927362927708498/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5753927362927708498' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5753927362927708498'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5753927362927708498'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/data-centre-in-sky.html' title='The data centre in the sky'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-546069534507042086</id><published>2008-05-06T14:15:00.012Z</published><updated>2008-05-13T10:32:23.907Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester SL4 dcache troubles</title><content type='html'>Manchester since the upgrade to SL4 is experiencing problems with dcache.&lt;br /&gt;&lt;br /&gt;1) pnfs doesn't seem to take a load beyond 200 atlas jobs (it times out) Alessandra has been unable to replicate the problem production is seeing. Even  starting 200 clients at the same time on the same file production is using all she could see was a transfer time increased from 2 secons to ~190 seconds but no time out. On saturday when she looked on the dashboard she has found 99.8% of ~1790 jobs successfully completed in the last 24 hours which also sounds in contradiction with the 200 jobs at the time statistics and needs to be explained.&lt;br /&gt;&lt;br /&gt;2) replica manager doesn't work anymore, i.e. it doesn't even start, no resilience active. The error is a java InitSQL that should have been caused by the lack of a parameter according to the dcache developers. We sent them the requested configuration files and they couldn't find anything wrong with them. We have given access to dcache to Greig and he couldn't see anything wrong either. A developer suggested to move to a newer version of dcache to solve the problem, which we had tried already, but the new version has a new problem. From the errors it seems that the schema has changed, but we didn't obtain a reply to this. In this instance the replica manager starts but cannot insert data in the database. The replica manager obviously helps to cut in half transfer times because there is more than one pool node serving the files (I tested this on the SL3 dcache. 60 concurrent clients take max 35 sec each instead of 70. If the number of clients increases the effect is less but still in the range of 30%. In any case we are talking about a fistful of seconds not in the timeout range as it happens to production.&lt;br /&gt;&lt;br /&gt;3) Finally even if all these problems were solved Space Manager isn't compatible with Resilience. So pools with space tokens will not have the benefit of duplicates. Alessandra has asked already 2 months ago what was the policy in case she had to chose. It was agreed that for these initial tests it wasn't a problem.&lt;br /&gt;&lt;br /&gt;4) Another problem specific to Atlas is that although Manchester has 2 dcache&lt;br /&gt;instances  they have insisted to use only 1 for quite sometime. This has obviously affected production heavily. After a discussion at CERN they agreed finally to split and use both instances but that hasn't  happened yet.&lt;br /&gt;&lt;br /&gt;5) This is minor but equally important for manchester: VO with DNS style names are mishandled by dcache YAIM. We will open a ticket.&lt;br /&gt;&lt;br /&gt;We have applied all the optimization suggested by the developers. Even those not necessary and nothing has changed. the old dcache instance without optimizations and with the replica manager working is taking a load of 300-400&lt;br /&gt;atlas users jobs. According to local users who are using it for their local production both reading from it and writing into it they have an almost 100% rate of success (last week 7 jobs failures over 2000 jobs submitted).&lt;br /&gt;&lt;br /&gt;Applied optimizations:&lt;br /&gt;&lt;br /&gt;1) Split pnfs from the dcache head node: we can now run 200 production jobs. (but then again as already said the old dcache can take 400 jobs and the head node isn't split)&lt;br /&gt;2) Apply postgres optimizations: no results&lt;br /&gt;3) Apply kernel optimization for networking from CERN: transfers of small files&lt;br /&gt;30% faster but could also be a less loaded cluster.&lt;br /&gt;&lt;br /&gt;Most of the problems might come from the attempt of maintaining the old data. So we will try to to install a new dcache instance without it. Although it is not a very sustainable choice, it might help to understand what is the problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-546069534507042086?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/546069534507042086/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=546069534507042086' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/546069534507042086'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/546069534507042086'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/05/manchester-sl4-dcache-troubles.html' title='Manchester SL4 dcache troubles'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-272158383722538513</id><published>2008-04-09T07:53:00.003Z</published><updated>2008-04-10T09:41:46.348Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='atlas athena'/><title type='text'>Athena release 14 - new dependency</title><content type='html'>Athena release 14 has a new dependency on package 'libgfortran'. Sites with local Atlas users may want to check they have this. The runtime error message is rather difficult to decipher, however the buildtime error is explicit. I've added the package to the &lt;a href="https://twiki.cern.ch/twiki/bin/view/Atlas/DAonSLC4"&gt;required packages twiki page&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-272158383722538513?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/272158383722538513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=272158383722538513' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/272158383722538513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/272158383722538513'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/04/athena-release-14-new-dependancy.html' title='Athena release 14 - new dependency'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1477109283241269426</id><published>2008-02-04T23:12:00.001Z</published><updated>2008-03-11T16:40:00.061Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><title type='text'>Power outage</title><content type='html'>Site-wide power outage occured at Lancaster Uni this evening. The juice is now flowing but some intervention is required tomorrow morning before we're back to normal operations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1477109283241269426?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1477109283241269426/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1477109283241269426' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1477109283241269426'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1477109283241269426'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/02/power-outage.html' title='Power outage'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5139682285849884083</id><published>2008-02-04T12:51:00.001Z</published><updated>2008-03-11T16:39:40.612Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>I hate java!</title><content type='html'>#$%^&amp;amp;*@!!!!!!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5139682285849884083?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5139682285849884083/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5139682285849884083' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5139682285849884083'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5139682285849884083'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/02/i-hate-java.html' title='I hate java!'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8103751531712692341</id><published>2008-01-30T12:02:00.000Z</published><updated>2008-01-30T12:11:32.244Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Liverpool'/><title type='text'>Liverpool update</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style=";font-family:webdings;font-size:100%;"  &gt;From Mike reply:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:webdings;font-size:100%;"  &gt;* We'll stay with dcache and are about to rebuild the whole SE (and the whole cluster including a new multi-core CE) when we shut down for a week soon to install SL4. Everything is under test at present and we are upgrading the rack software servers to 250GB RAID1 to cope with the&lt;/span&gt;&lt;span style=";font-family:webdings;font-size:100%;" class="moz-txt-citetags"  &gt; &lt;/span&gt;&lt;span style=";font-family:webdings;font-size:100%;"  &gt;100GB size of the ATLAS code.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:webdings;font-size:100%;"  &gt;* We are still testing Puppet (on our non-LCG cluster) as our preferred solution. It looks fine but we are not yet sure it will scale to many 100s of nodes.&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre wrap=""&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8103751531712692341?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8103751531712692341/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8103751531712692341' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8103751531712692341'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8103751531712692341'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2008/01/liverpool-update.html' title='Liverpool update'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3334257625993525859</id><published>2007-12-20T15:53:00.000Z</published><updated>2008-01-30T12:10:28.479Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><title type='text'>Lancaster's Winter Dcache Dramas</title><content type='html'>It's been a tough couple of months for Lancaster, with our SE giving us a number of problems.&lt;br /&gt;&lt;br /&gt;Our first drama, at the start of the month, was caused by unforseen complications with our upgrade to dcache 1.8. Knowing that we were low on the support list due to being only a Tier 2, but feeling emboldened by the highly useful srm 2.2 workshop in Edinburgh and the good few years we've spent in the dcache trenches we decided to take the plunge. And faced a good few days of downtime beyond the one we had scheduled as we faced down a number of bugs with the early versions of dcache 1.8 (fixed by upgrading to higher patch levels), then faced problems due to changes in the gridftp protocol highlighted inconsisencies with the users on our pnfs node and gridftp door node. Due to a hack long ago several VOs had different user.conf entries and therefore UIDs on our door nodes and pnfs node. This never caused problems before, but after the upgrade the doors were passing the uids to the pnfs node so new files and directories were created with the correct group (as the gids were consistent) but a wrong uid, causing permission troubles whenever a delete was called. This was a classic case of a problem that was hell to the cause behind but once figured out was thankfully easy to solve. Once we fixed that one it was green tests for a while.&lt;br /&gt;&lt;br /&gt;Then dcache drama number two came along a week later- a massive postgres db failure on our pnfs node. The postgres database contains all the information that dcache uses to match the fairly anonymously named files on the poolnodes to entries in the pnfs namespace- without it dcache has no idea which files are which, so with it bust the files are almost as good as lost. Which is why it should be backed up regularly. We did this twice daily, as least we thought we did- a cron problem had meant that our backups hadn't been made for a while and a rollback to it would mean a fair amount of data might be lost. So we spent 3 days doing arcane sql rituals to try and bring back the database, but it had too heavily corrupted itself and we had to rollback.&lt;br /&gt;&lt;br /&gt;The cause of the database crash and corruption was a "wrap around" error. Postgres requires regular "vacuuming" to clean up after itself, otherwise it essentially starts writing over itself. This crash took us by surprise, as we not only have postgres looking after itself with optimised auto-vacuuming occuring regularly, but during the 1.8 upgrade I took the time to do a manual full vacuum, which was only a week before this one. Also postgres is designed to freeze in the event of being at risk of a wraparound error rather then overwrite itself, and this didn't happen. The first we heard of it pnfs and postgres had stopped responding and there were wraparound error messages in the logs, no warning of the impending disaster.&lt;br /&gt;&lt;br /&gt;Luckily the data rollback seems to have not affected the VOs too much. We had one ticket from Atlas, who after we explained our situation to them handily cleaned up their file catalogues. The guys over at dcache hinted at a possible way of rebuilding the lost databases from the pnfs logs, although sadly this isn't simply a case of recreating pnfs related sql entries and they've been too busy with Tier 1 support to look into this further.&lt;br /&gt;&lt;br /&gt;Since then we've fixed our backups and applied a nagios test to ensure the backups are less then a day old-the biggest trouble here was that the reluctance to use an old backup meant we wasted over 3 days banging our heads trying to bring back a dead database rather then a few hours it would take to restore from backup and verify things were working. And it appears the experiments were more affected by us being in downtime then by the loss of easily replicatable data. In the end I think I caused more trouble going over the top on my data recovery attempts then if I had been gung ho and used the old backup once things looked a bit bleak for the remains of the postgres database. At least we've now set things up so the likeliness of it happening again is slim, but the circumstances behind the original database errors are still unknown, which leaves me a little worried.&lt;br /&gt;&lt;br /&gt;Have a good Winter Festival and Holday Season everyone- but before you head off to your warm fires and cold beers check the age of your backups just in case...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3334257625993525859?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3334257625993525859/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3334257625993525859' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3334257625993525859'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3334257625993525859'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/12/lancasters-winter-dcache-dramas.html' title='Lancaster&apos;s Winter Dcache Dramas'/><author><name>Matt</name><uri>http://www.blogger.com/profile/16514316941712112895</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5135587615688705041</id><published>2007-12-06T17:30:00.000Z</published><updated>2007-12-07T10:39:53.152Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester various</title><content type='html'>- core path was set to /tmp/core-various-param in sysctl.conf and was creating a lot of problems to dzero jobs. It was also creating problems to others as they were filling /tmp and consequently maradona errors were looming around. The path has been changed back to the default and also I set core size 0 in limits.conf to prevent any other problem repeating itself with a lesser degree in /scratch.&lt;br /&gt;&lt;br /&gt;- dcache doors were open on the wrong nodes. node_config is the correct one but it was copied before stopping dcache-core service and now /etc/init.d/dcache-core stop doesn't have any effect. The doors have also a keep alive script so it is not enough to kill the java proesses one has to kill also the parents.&lt;br /&gt;&lt;br /&gt;- cfengine config files are being rewritten to make them less criptic.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5135587615688705041?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5135587615688705041/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5135587615688705041' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5135587615688705041'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5135587615688705041'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/12/manchester-various.html' title='Manchester various'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7791291352454894228</id><published>2007-11-19T16:05:00.000Z</published><updated>2007-11-19T16:09:38.525Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester black holes for atlas</title><content type='html'>Atlas job failing because of the following errors:&lt;br /&gt;====================================================&lt;br /&gt;All jobs fail because 2 bad nodes fail like&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;/opt/globus/bin/globus-gass-cache: line 4: globus_source: command not found&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;/opt/globus/bin/globus-gass-cache: line 6: /globus-gass-cache-util.pl: No such file or directory&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;/opt/globus/bin/globus-gass-cache: line 6: exec: /globus-gass-cache-util.pl: cannot execute: No such file or directory&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;submit-helper script running on host bohr1428 gave error: could not add entry in the local gass cache for stdout&lt;br /&gt;===================================================&lt;br /&gt;Problem caused by&lt;br /&gt;&lt;br /&gt;${GLOBUS_LOCATION}/libexec/globus-script-initializer&lt;br /&gt;&lt;br /&gt;being empty.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7791291352454894228?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7791291352454894228/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7791291352454894228' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7791291352454894228'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7791291352454894228'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/manchester-black-holes-for-atlas.html' title='Manchester black holes for atlas'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3341343307446379908</id><published>2007-11-13T10:21:00.002Z</published><updated>2007-11-13T10:35:12.441Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Some SAM tests don't respect downtime</title><content type='html'>Sheffield is shown to fail the CE-host-cert-valid test while in downtime. SAM tests should all behave the same. This is on top of the very confusing display of the results in alternate lines. I opened a ticket.&lt;br /&gt;&lt;a href="https://gus.fzk.de/ws/ticket_info.php?ticket=28983"&gt;&lt;br /&gt;https://gus.fzk.de/ws/ticket_info.php?ticket=28983&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3341343307446379908?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3341343307446379908/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3341343307446379908' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3341343307446379908'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3341343307446379908'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/some-sam-tests-dont-respect-downtime_1403.html' title='Some SAM tests don&apos;t respect downtime'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8156510378529761080</id><published>2007-11-11T17:50:00.000Z</published><updated>2007-11-11T18:02:30.519Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester sw repository reorganised</title><content type='html'>To simplify the maintainance of multiple releases and architectures I reorganised the software (yum) repository in Manchester.&lt;br /&gt;&lt;br /&gt;While before we had to maintain a yum.conf for each release and architecture now we just need to add links to the right place. I wrote a recipe  on my favourite site:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/Yum"&gt;http://www.sysadmin.hep.ac.uk/wiki/Yum&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;This will allow to remove also the complications introduced in cfengine conf files to maintain multiple yum.conf versions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8156510378529761080?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8156510378529761080/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8156510378529761080' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8156510378529761080'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8156510378529761080'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/manchester-sw-repository-reorganised.html' title='Manchester sw repository reorganised'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-596342704813078083</id><published>2007-11-09T17:02:00.000Z</published><updated>2007-11-09T17:19:03.955Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='firewall'/><title type='text'>1,000,000th job has passed by</title><content type='html'>This week the batch system ticked over it's millionth job. The lucky user was biomed005, and no, it was nothing to do with rsa768. The 0th job was way back in August 2005 when we replaced the old batch system with torque. How many of these million were successful? I shudder to think but I'm sure it's improving :-)&lt;br /&gt;&lt;br /&gt;In other news, we're having big problems with our campus firewall, it blocks outgoing port 80 and 443 to ensure that traffic passes through the university proxy server. Unfortunately some web clients such as wget and curl make it impossible to use the proxy for these ports whilst bypassing the proxy for all other ports. Atlas need this with the new PANDA pilot job framework. We installed a squid proxy of our own (good idea Graeme) which allows for greater control. No luck with handling https traffic so we really need to get a hole punched in the campus firewall. I'm confident the uni systems guys will oblige ;-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-596342704813078083?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/596342704813078083/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=596342704813078083' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/596342704813078083'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/596342704813078083'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/1000000th-job-has-passed-by.html' title='1,000,000th job has passed by'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8448111696634223523</id><published>2007-11-09T10:40:00.000Z</published><updated>2007-11-09T10:43:31.951Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Sheffield in downtime</title><content type='html'>Sheffield has been put in downtime until Monday 12/11/2007 at 5 pm.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Reason: Power cut affecting much of central sheffield.  Substation exploded.  Not even allowed inside the physics building.&lt;br /&gt;&lt;br /&gt;Matt is also back in the GOCDB now as site admin.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8448111696634223523?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8448111696634223523/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8448111696634223523' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8448111696634223523'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8448111696634223523'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/sheffield-in-downtime.html' title='Sheffield in downtime'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3417042113115481284</id><published>2007-11-03T12:08:00.000Z</published><updated>2007-11-03T13:22:54.248Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester CEs and RGMA problems</title><content type='html'>Still don't know what happened to ce02 and why ops test didn't work and my jobs hanged forever while anybody else could run (atlas claims 88 percent efficiency in the last 24 hours). Anyway I updated ce02 manually (rpm -ihv) to the same set of rpms that are on ce01 and now the problem I had, globus hanging, has disappeared . The ops tests are successful again and we got out of the atlas blacklisting. I fixed also ce01 that yesterday picked up the wrong java version. I need to change a couple of things on the kickstart server so that these incidents don't happen.&lt;br /&gt;&lt;br /&gt;Also I had to manually kill tomcat which was not responding and restart it on the MON box. Accounting published successfully after this.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3417042113115481284?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3417042113115481284/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3417042113115481284' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3417042113115481284'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3417042113115481284'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/manchester-ces-problems.html' title='Manchester CEs and RGMA problems'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1292127778341535523</id><published>2007-11-02T13:59:00.000Z</published><updated>2007-11-02T14:00:40.012Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Sheffield accounting</title><content type='html'>From Matt:&lt;br /&gt;&lt;br /&gt;/opt/glite/bin/apel-pbs-log-parser&lt;br /&gt;is trying o contact the ce on 2170, I think expecting the site bdii to be there.&lt;br /&gt;I changed &lt;gii&gt;ce_node&lt;&lt;i class="moz-txt-slash"&gt;&lt;span class="moz-txt-tag"&gt;/&lt;/span&gt;GII&gt; to &lt;gii&gt;mon_node&lt;/gii&gt; in /opt/glite/etc&lt;span class="moz-txt-tag"&gt;/&lt;/span&gt;&lt;/i&gt;glite-apel-pbs/parser-config-yaim.xml&lt;br /&gt;and now thing seem much improved.&lt;br /&gt;&lt;br /&gt;However, I am getting this&lt;br /&gt;&lt;br /&gt;Fri Nov  2 13:48:39 UTC 2007: apel-publisher - Record/s found: 8539&lt;br /&gt;Fri Nov  2 13:48:39 UTC 2007: apel-publisher - Checking Archiver is Online&lt;br /&gt;Fri Nov  2 13:49:40 UTC 2007: apel-publisher - Unable to retrieve any response while querying the GOC&lt;br /&gt;Fri Nov  2 13:49:40 UTC 2007: apel-publisher - Archiver Not Responding: Please inform &lt;a class="moz-txt-link-abbreviated" href="mailto:apel-support@listserv.cclrc.ac.uk"&gt;apel-support@listserv.cclrc.ac.uk&lt;/a&gt;&lt;br /&gt;Fri Nov  2 13:49:40 UTC 2007: apel-publisher - WARNING - Received a 'null' result set while querying the 'LcgRecords' table using rgma, this probably means the GOC is currently off-line, will therefore cancel attempt to re-publish&lt;br /&gt;&lt;br /&gt;running /opt/glite/bin/apel-publisher on the mon box.&lt;br /&gt;&lt;br /&gt;I the goc machine is really off-line, I'll have to wait to publish the missing data for sheffield.&lt;/gii&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1292127778341535523?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1292127778341535523/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1292127778341535523' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1292127778341535523'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1292127778341535523'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/sheffield-accounting.html' title='Sheffield accounting'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8937973457897217175</id><published>2007-11-02T12:27:00.000Z</published><updated>2007-11-02T17:01:48.818Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><category scheme='http://www.blogger.com/atom/ns#' term='SL4'/><title type='text'>Manchester SL4</title><content type='html'>Not big news for other sites but I have installed an SL4 UI in manchester. Still 32bit cause the UIs at the Tier2 are old machines. However I'd like to express my relief that once the missing 3rd parties rpms were in place the installation&lt;br /&gt;went smoothly.&lt;br /&gt;&lt;br /&gt;After some struggling with cfengine keys which I was dispairing tosolve by the end of the evening I managed to install also a WN 32bit. At least cfengine doesn't give any more errors and runs happily.&lt;br /&gt;&lt;br /&gt;Tackling dcache now and the new yaim structure.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8937973457897217175?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8937973457897217175/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8937973457897217175' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8937973457897217175'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8937973457897217175'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/manchester-sl4.html' title='Manchester SL4'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6204767751270509433</id><published>2007-11-01T15:16:00.000Z</published><updated>2007-11-02T12:27:14.784Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Sheffield</title><content type='html'>Quiet night for sheffield after reimaged nodes where taken offline in PBS. Matt also increased the number of ssh connections allowed on the CE from 10 to 100 to reduce the time outs between the WN and CE and reduce the incidence of Maradona errors.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6204767751270509433?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6204767751270509433/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6204767751270509433' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6204767751270509433'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6204767751270509433'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/11/sheffield.html' title='Sheffield'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4266876643742276821</id><published>2007-10-31T17:43:00.000Z</published><updated>2007-11-01T15:16:17.492Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester hat tricks</title><content type='html'>Manchester CE ce02 has been blacklisted by atlas since yesterday because it fails&lt;br /&gt;the ops tests and therefore it is also failing the Steve lloyds tests and has avaialbility 0. However there is no apparent reason why these tests should fail. Besides ce02 is doing some magic: there were 576 jobs running from 5 different VOs when I started writing this among which atlas production jobs, and now there 12 hours later there are 1128. I'm baffled.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4266876643742276821?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4266876643742276821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4266876643742276821' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4266876643742276821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4266876643742276821'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/manchester-hat-tricks.html' title='Manchester hat tricks'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7129135390011762535</id><published>2007-10-30T17:00:00.001Z</published><updated>2007-10-31T14:18:34.794Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='VOMS'/><title type='text'>Regional VOs</title><content type='html'>vo.northgrid.ac.uk&lt;br /&gt;vo.southgrid.ac.uk&lt;br /&gt;&lt;br /&gt;both created no users yet in them. Need to enable them at sites probably to get more progress.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7129135390011762535?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7129135390011762535/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7129135390011762535' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7129135390011762535'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7129135390011762535'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/regional-vos.html' title='Regional VOs'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4753186457597735267</id><published>2007-10-30T16:40:00.000Z</published><updated>2007-10-30T16:47:10.289Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>user certificates: p12 to pem</title><content type='html'>since I was renewing my certificate I added a small script (&lt;a href="http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/p12topem.sh"&gt;p12topem.sh&lt;/a&gt;) to the subversion repository to convert users p12 certificates into pem format and set their unix permission correctly. i lnked it from here:&lt;br /&gt;&lt;br /&gt;&lt;a href="https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance"&gt;https://www.sysadmin.hep.ac.uk/wiki/CA_Certificates_Maintenance&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;It assumes $HOME/.globus/user*.pem names. it doesn't therefore handle host certificates but could be easily extended.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4753186457597735267?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4753186457597735267/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4753186457597735267' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4753186457597735267'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4753186457597735267'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/user-certificates-p12-to-pem.html' title='user certificates: p12 to pem'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6552121364990975673</id><published>2007-10-29T16:52:00.000Z</published><updated>2007-10-29T17:05:03.958Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Links to monitoring pages update</title><content type='html'>I added three links to the FCR one per experiment with all the UK sites selected. It hopefully will make easier to find out who has been blacklisted.&lt;br /&gt;&lt;a href="http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages"&gt;&lt;br /&gt;http://www.gridpp.ac.uk/wiki/Links_Monitoring_pages&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I also added GridMap link and linked Steve monitoring both as generic dteam and atlas plus the quarters summary plots.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6552121364990975673?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6552121364990975673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6552121364990975673' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6552121364990975673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6552121364990975673'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/links-to-monitoring-pages-update.html' title='Links to monitoring pages update'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2968640197907471593</id><published>2007-10-19T15:10:00.000Z</published><updated>2007-10-29T17:06:33.301Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Sheffield latest</title><content type='html'>Trying to stabilize Sheffield cluster.&lt;br /&gt;After the scheduled power outage the nodes didn't restart properly and some of the old jobs needed to be cleaned up. After that the cluster was ok apart from the BDII dropping out. We have applied the famous Kostas patch&lt;br /&gt;&lt;a href="https://savannah.cern.ch/bugs/?16625"&gt;&lt;br /&gt;https://savannah.cern.ch/bugs/?16625&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;which is getting into the release after 1.5 years Hurray!!!&lt;br /&gt;&lt;br /&gt;The stability of the BDII has improved and DPM seems stable. The SAM tests have been stable over the weekend and the today Steve Atlas tests showed a 96% availability which is a big improvement. However the cluster filled up this morning and the instability reappeared, sign that there is still something to fix on the worker nodes and in the scheduling. Added a reservation for ops and looking at the WNs some of which were re-imaged this morning.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2968640197907471593?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2968640197907471593/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2968640197907471593' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2968640197907471593'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2968640197907471593'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/sheffield-latest.html' title='Sheffield latest'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1892338902174403122</id><published>2007-10-18T14:45:00.000Z</published><updated>2007-10-18T14:46:21.645Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Manchester availability</title><content type='html'>SAM tests both ops and atlas, were failing due to dcache problems. Part of it was due to the fact that Judit has changed her DN and somehow the cron job to build the dcache kpwd file wasn't working. In addition to that dcahce02 had to be restarted (both core and pnfs) as usual it started to work again after that without any apparent reason of why it failed in the first place. gPlazma is not enabled yet.&lt;br /&gt;&lt;br /&gt;Mostly that's the reason of the drop in october.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1892338902174403122?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1892338902174403122/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1892338902174403122' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1892338902174403122'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1892338902174403122'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/manchester-availability.html' title='Manchester availability'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3277756478177031225</id><published>2007-10-15T11:20:00.000Z</published><updated>2007-10-15T11:42:18.193Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><category scheme='http://www.blogger.com/atom/ns#' term='Availability'/><title type='text'>Availability update</title><content type='html'>Lancs site availability looks OK for the last month at 94% which is 13% above the GridPP average, and this includes a couple of weekends lost due to dCache problems.  The record from July-September has been updated on &lt;a href="http://www.gridpp.ac.uk/wiki/SAM_availability:_July-September_2007#UKI-NORTHGRID-LANCS-HEP"&gt;Jeremy's page&lt;/a&gt;. We still get the occasional failed SAM submission, no idea what causes these but they serve to deny the availability reaching high nineties.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;June-July instability was dCache issue with the pnfs mount options, this only affected SAM tests where files were created and immediately removed. &lt;/li&gt;&lt;li&gt; mid-August were SL4 upgrade problems, caused by a few blackhole WNs. This was tracked to the jpackage repository being down which screwed with the auto-install of some WNs. &lt;/li&gt;&lt;li&gt; mid-September problems were caused by adding a new dCache pool, not bringing online until the issue is understood. &lt;/li&gt;&lt;/ul&gt;Job slot occupancy looks ok, non-HEP VOs like fusion and biomed helping to fill slots left by moderate production by Atlas.&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_mw1Yh1GNWVc/RxNRa8yWvMI/AAAAAAAAAAs/wbKsIJP7Hmo/s1600-h/last-month.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_mw1Yh1GNWVc/RxNRa8yWvMI/AAAAAAAAAAs/wbKsIJP7Hmo/s400/last-month.png" alt="" id="BLOGGER_PHOTO_ID_5121526724686167234" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3277756478177031225?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3277756478177031225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3277756478177031225' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3277756478177031225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3277756478177031225'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/availability-update.html' title='Availability update'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_mw1Yh1GNWVc/RxNRa8yWvMI/AAAAAAAAAAs/wbKsIJP7Hmo/s72-c/last-month.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-5456684677206841141</id><published>2007-10-12T09:40:00.001Z</published><updated>2007-10-12T13:17:22.150Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Sys Admin Requests wiki pages</title><content type='html'>YAIM has a new wiki page for sys admins requests. Maria has sent an announcement to the LCG-ROLLOUT. I added, for bookkeeping, a link and explanations in the sys admin wiki wishlist page where also the ROCs admins management tools requests is linked from.&lt;br /&gt;&lt;br /&gt;&lt;a class="moz-txt-link-freetext" href="http://www.sysadmin.hep.ac.uk/wiki/Wishlist"&gt;http://www.sysadmin.hep.ac.uk/wiki/Wishlist&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-5456684677206841141?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/5456684677206841141/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=5456684677206841141' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5456684677206841141'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/5456684677206841141'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/sys-admin-requests-wiki-pages_12.html' title='Sys Admin Requests wiki pages'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8054059540448291925</id><published>2007-10-09T15:49:00.000Z</published><updated>2007-10-10T13:02:14.518Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>BDII doc page</title><content type='html'>After the trouble sheffield went through with the BDII I started a BDII page in the sysadmin wiki.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/BDII"&gt;http://www.sysadmin.hep.ac.uk/wiki/BDII&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8054059540448291925?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8054059540448291925/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8054059540448291925' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8054059540448291925'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8054059540448291925'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/bdii-page.html' title='BDII doc page'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-394339168866817901</id><published>2007-10-08T15:52:00.000Z</published><updated>2007-10-10T13:02:51.099Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Manchester RGMA fixed</title><content type='html'>Fixed RGMA in Manchester. It had, for still obscure reasons, wrong permissions on the host key files. Started a RGMA troubleshooting page on sysadmin wiki:&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA_Errors"&gt;&lt;br /&gt;&lt;/a&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA"&gt;http://www.sysadmin.hep.ac.uk/wiki/RGMA#RGMA&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-394339168866817901?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/394339168866817901/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=394339168866817901' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/394339168866817901'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/394339168866817901'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/manchester-rgma-fixed.html' title='Manchester RGMA fixed'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4868254560457785787</id><published>2007-10-08T13:56:00.000Z</published><updated>2007-10-08T14:04:15.396Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>EGEE '07</title><content type='html'>EGEE conference. I've given a talk in the SA1-JRA1 session which seems to have had a positive result which will hopefully have some follow up.&lt;br /&gt;&lt;br /&gt;Talk can be found at&lt;br /&gt;&lt;a href="http://indico.cern.ch/materialDisplay.py?contribId=30&amp;amp;sessionId=49&amp;amp;materialId=slides&amp;amp;confId=18714"&gt;&lt;br /&gt;http://indico.cern.ch/materialDisplay.py?contribId=30&amp;amp;sessionId=49&amp;amp;materialId=slides&amp;amp;confId=18714&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;and is the sibling of the one we gave in Stockholm at the Ops workshop on problems with SA3 and within SA1.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://indico.cern.ch/contributionDisplay.py?contribId=25&amp;amp;confId=12807"&gt;http://indico.cern.ch/contributionDisplay.py?contribId=25&amp;amp;confId=12807&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;which had some follow up with SA3 that can be found here&lt;br /&gt;&lt;a href="https://savannah.cern.ch/task/?5267"&gt;&lt;br /&gt;https://savannah.cern.ch/task/?5267&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4868254560457785787?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4868254560457785787/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4868254560457785787' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4868254560457785787'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4868254560457785787'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/egee-conference.html' title='EGEE &apos;07'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7640839514295653963</id><published>2007-10-08T10:10:00.000Z</published><updated>2007-10-08T12:46:31.133Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Its alive !</title><content type='html'>Sheffield problems reviewed&lt;br /&gt;&lt;br /&gt;During the last update to the DPM I ran in to several problem.&lt;br /&gt;&lt;br /&gt;1.  DPM update failed due to changes in the way the password are&lt;br /&gt;stored in mysql&lt;br /&gt;2.  A miss understandind with the new version of yaim that rolled out&lt;br /&gt;at the same time&lt;br /&gt;3.  Config errors with the sBDii&lt;br /&gt;4.  mds-vo-name&lt;br /&gt;5.  Too many roll outs in one go for me to have a clue which broke and where to start&lt;br /&gt;looking.&lt;br /&gt;&lt;br /&gt;DPM update fails&lt;br /&gt;I would like to thank Graeme for the great update instructions, they&lt;br /&gt;helped lots.  The problems came when the update script used a&lt;br /&gt;different hashing method to the one used by mysql problem found here&lt;br /&gt;http://&lt;&gt;.  This took some finding, it also means every&lt;br /&gt;time we run yaim config on the SE we have to go back and fixs the&lt;br /&gt;passwords again, this is because yain still uses the old hash not the&lt;br /&gt;new one.&lt;br /&gt;&lt;br /&gt;Yaim update half way and congig errors&lt;br /&gt;This confused the hell out of me one minute I'm using yaim scripts to&lt;br /&gt;run updates.  Next I have an updated version of yaim that I had to&lt;br /&gt;pass flags to and is where I guess I started to make the mistakes that&lt;br /&gt;lead to me setting the SE as a sBDii.  After getting lost with the new&lt;br /&gt;yain I told the wrong machine that it was a sBDii and never relised.&lt;br /&gt;&lt;br /&gt;mds-vo-name&lt;br /&gt;With the help of Henry, we found out that our information was wrong ie we had&lt;br /&gt;&lt;br /&gt;mds-vo-name=local it is now mds-vo-name=resource&lt;br /&gt;&lt;br /&gt;Once this was changed in the site-info.def and yaim was re ran on our mon box which is&lt;br /&gt;also out sBDii it alll seamed to work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7640839514295653963?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7640839514295653963/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7640839514295653963' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7640839514295653963'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7640839514295653963'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/10/its-alive.html' title='Its alive !'/><author><name>Gerry</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8868402898257370345</id><published>2007-09-25T08:27:00.000Z</published><updated>2007-10-08T12:46:48.531Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Sheffield</title><content type='html'>Hi all&lt;br /&gt;&lt;br /&gt;Sorry it been so quiet on the Sheffield front.  I've been out of the country, and it currently registration here.&lt;br /&gt;&lt;br /&gt;Whats the state of the LCG here.  I feel like I'm chasing my tail hence there will shortly be a back of email asking for help from TB Support.  I have notice there have been several updates while I've been away so I will add them before getting back to the main problems of why our SE seam to not have an entry in the BDii&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8868402898257370345?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8868402898257370345/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8868402898257370345' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8868402898257370345'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8868402898257370345'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/09/sheffield.html' title='Sheffield'/><author><name>Gerry</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2265922935051042930</id><published>2007-09-14T15:42:00.000Z</published><updated>2007-09-14T16:19:04.130Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='64bit'/><category scheme='http://www.blogger.com/atom/ns#' term='SL4'/><title type='text'>Latest on WN SL4/64 upgrade</title><content type='html'>I've created a &lt;a href="http://www.gridpp.ac.uk/wiki/SL4_additional_packages"&gt;gridpp wiki page&lt;/a&gt; which lists the cfengine config we're using to satisfy various VO requirements. Things have changed recently with Atlas no longer requiring a 32 bit version of python to be installed, it's now included in the KITS release. We still have build problems with release 12.0.6 as used by Steve's tests so would be interested to see how others get on with that. The Atlas experts advise a move to the 13.0.X branch. Atlas production looks healthy again with plenty of queued jobs for the weekend so hopefully smooth sailing from now on.&lt;br /&gt;&lt;br /&gt;Advice to Atlas sites upgrading to SL4/64:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Install packages listed here: &lt;a href="https://twiki.cern.ch/twiki/bin/view/Atlas/DAonSLC4"&gt;https://twiki.cern.ch/twiki/bin/view/Atlas/DAonSLC4&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Expect failures when building code with release 12.0.6 &lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2265922935051042930?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2265922935051042930/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2265922935051042930' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2265922935051042930'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2265922935051042930'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/09/latest-on-wn-sl464-upgrade.html' title='Latest on WN SL4/64 upgrade'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-7238991579203870697</id><published>2007-08-26T13:29:00.000Z</published><updated>2007-08-26T13:38:30.163Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><category scheme='http://www.blogger.com/atom/ns#' term='Security'/><title type='text'>Some new links about security</title><content type='html'>This article is an interesting example of how even someone with very little experience can still do some basic forensic.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://blog.gnist.org/article.php?story=HollidayCracking"&gt;http://blog.gnist.org/article.php?story=HollidayCracking&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I added the link under the forensic section on the sys admin wiki&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Generic_Links"&gt;&lt;/a&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/Basic_Security#Forensic"&gt;http://www.sysadmin.hep.ac.uk/wiki/Basic_Security#Forensic&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Since I was at it I added a firewall section to&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Firewall_configuration_and_Services_ports"&gt;http://www.sysadmin.hep.ac.uk/wiki/Grid_Security#Firewall_configuration_and_Services_ports&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-7238991579203870697?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/7238991579203870697/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=7238991579203870697' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7238991579203870697'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/7238991579203870697'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/basic-forensic-on-unixlinux.html' title='Some new links about security'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-254258828023421465</id><published>2007-08-26T12:35:00.000Z</published><updated>2007-08-26T13:08:43.608Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Dcache Troubleshooting page</title><content type='html'>&lt;a name="txn-10696" href="https://rt.tier2.hep.manchester.ac.uk/Ticket/Display.html?id=571#txn-10696"&gt;&lt;/a&gt;&lt;div class="messagebody"&gt; &lt;span style="color: rgb(0, 0, 0);"&gt;My tests on dcache started to fail for obscure reasons due to gsidcap doors misbehaving. I started a trouble shooting page for dcache&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/DCache_Troubleshooting"&gt;http://www.sysadmin.hep.ac.uk/wiki/DCache_Troubleshooting&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-254258828023421465?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/254258828023421465/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=254258828023421465' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/254258828023421465'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/254258828023421465'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/my-tests-on-dcache-started-to-fail-for.html' title='Dcache Troubleshooting page'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8109868559179785385</id><published>2007-08-23T09:46:00.000Z</published><updated>2007-08-26T09:13:37.476Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Another biomed user banned</title><content type='html'>Manchester has banned another biomed user for filling /tmp and causing trouble for other users. I opened a ticket.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;a href="https://gus.fzk.de/ws/ticket_info.php?ticket=26147"&gt; https://gus.fzk.de/ws/ticket_info.php?ticket=26147&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8109868559179785385?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8109868559179785385/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8109868559179785385' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8109868559179785385'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8109868559179785385'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/another-biomed-user-banned.html' title='Another biomed user banned'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4350456784834507830</id><published>2007-08-22T14:59:00.000Z</published><updated>2007-10-15T11:19:14.737Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster SL4'/><category scheme='http://www.blogger.com/atom/ns#' term='64bit'/><category scheme='http://www.blogger.com/atom/ns#' term='SL4'/><title type='text'>WNs installed with SL4</title><content type='html'>Last Wednesday was a scheduled downtime for Lancaster in order to do the SL4 upgrade on the WNs, as well as some assorted spring cleaning of other services. We can safely say this was our worst upgrade experience so far, a single day turned into a three-day downtime. Fortunately for most, this was self-inflicted pain rather than middleware issues. The fabric stuff (PXE, kickstart) went fine, our main problem was getting consistent users.conf and groups.conf files for the YAIM configuration, especially with the pool sgm/prd accounts and the dns-style VO names such as supernemo.vo.eu-egee.org. The &lt;a href="https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide311"&gt;latest YAIM 3.1 documentation&lt;/a&gt; provides a consistent description but our CE still used the 3.0 version so a few tweaks were needed (YAIM 3.1 has since been released for glite 3.0). Another issue was due to our wise old CE (lcg-CE) having a lot of crust from previous installations, in particular some environment variables which affected the YAIM configuration such that the newer vo.d/files were not considered. Finally, we needed to ensure the new sgm/prd pool groups were added to torque ACLs but YAIM does a fine job with this should you choose to use it along with the _GROUP_ENABLE variables.&lt;br /&gt;&lt;br /&gt;Anyway, things look good again with many biomed jobs, some atlas, dzero, hone and even a moderate number of lhcb jobs which supposed to have issues with SL4.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_mw1Yh1GNWVc/RsxQMxfRg9I/AAAAAAAAAAk/68tcBkLkx00/s1600-h/torque-graph.php.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://3.bp.blogspot.com/_mw1Yh1GNWVc/RsxQMxfRg9I/AAAAAAAAAAk/68tcBkLkx00/s320/torque-graph.php.png" alt="" id="BLOGGER_PHOTO_ID_5101540658277090258" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;On the whole, the YAIM configuration went well although the &lt;a href="https://cic.gridops.org/index.php?section=rc&amp;amp;page=supportedvos"&gt;VO information at CIC&lt;/a&gt; could still be improved with mapping requirements from VOMS groups to GID. LHCb provide a good example to other VOs, with explanations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4350456784834507830?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4350456784834507830/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4350456784834507830' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4350456784834507830'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4350456784834507830'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/wns-installed-with-sl4.html' title='WNs installed with SL4'/><author><name>Peter</name><uri>http://www.blogger.com/profile/05855046025692405834</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_mw1Yh1GNWVc/RsxQMxfRg9I/AAAAAAAAAAk/68tcBkLkx00/s72-c/torque-graph.php.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-425010506027155771</id><published>2007-08-20T11:31:00.000Z</published><updated>2007-08-20T11:42:24.670Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Sheffield'/><title type='text'>Week 33 start of 34</title><content type='html'>&lt;div style="text-align: left;"&gt;Main goings on have been the DMP update this was hampered by password version problems in MySQL, was resolved with help from here http://www.digitalpeer.com/id/mysql .  More problem came with the change to BDii new firewall ports open and yet there was still no data coming out.&lt;br /&gt;&lt;br /&gt;The BDii was going to be fixed today, however Sheffield has suffered several power cut over the last 24 hours.  This has affected the hole of the LCG here, recovery work is ongoing.&lt;br /&gt;&lt;a style="font-weight: bold;" class="fixed" href="https://webmail.shef.ac.uk/newhorde/services/go.php?url=http%3A%2F%2Fwww.digitalpeer.com%2Fid%2Fmysql" target="_blank"&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-425010506027155771?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/425010506027155771/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=425010506027155771' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/425010506027155771'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/425010506027155771'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/week-33-star-of-34.html' title='Week 33 start of 34'/><author><name>Gerry</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-8902356405134818601</id><published>2007-08-14T14:30:00.000Z</published><updated>2007-08-14T14:33:12.868Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>GOCDB3 permission denied</title><content type='html'>I can't edit NorthGrid sites anymore. I opened a ticket.&lt;br /&gt;&lt;br /&gt;https://gus.fzk.de/pages/ticket_details.php?ticket=25846&lt;br /&gt;&lt;br /&gt;I would be mildly curious to know if other people are experiencing the same or if I'm the only one.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-8902356405134818601?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/8902356405134818601/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=8902356405134818601' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8902356405134818601'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/8902356405134818601'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/gocdb3-permission-denied.html' title='GOCDB3 permission denied'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2602411908218470273</id><published>2007-08-13T16:52:00.000Z</published><updated>2007-08-13T16:55:09.925Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'></title><content type='html'>Manchester MON box overloaded again with a massive amount of CLOSE_WAIT connections.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://gus.fzk.de/pages/ticket_details.php?ticket=25647"&gt;https://gus.fzk.de/pages/ticket_details.php?ticket=25647&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;the problem seems to have been fixed but it affected the accounting for 2 or 3 days.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2602411908218470273?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2602411908218470273/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2602411908218470273' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2602411908218470273'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2602411908218470273'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/manchester-mon-box-overloaded-again.html' title=''/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4988498573813017685</id><published>2007-08-13T16:34:00.001Z</published><updated>2007-08-17T09:58:36.486Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>lcg_utils bug closed</title><content type='html'>Ticket about lcg_util bugs has been answered and closed&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt; https://gus.fzk.de/pages/ticket_details.php?ticket=25406&amp;from=allt&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Correct version of rpms to install is&lt;br /&gt;&lt;br /&gt;[aforti@niels003 aforti]$ rpm -qa GFAL-client lcg_util&lt;br /&gt;lcg_util-1.5.1-1&lt;br /&gt;&lt;div style="text-align: left;"&gt;GFAL-client-1.9.0-2&lt;br /&gt;&lt;br /&gt;Update (2007/08/17): The problem was incorrect dependencies expressed in the meta rpms. Maarten opened a savannah bug.&lt;br /&gt;&lt;br /&gt;&lt;a href="https://savannah.cern.ch/bugs/?28738"&gt;https://savannah.cern.ch/bugs/?28738&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4988498573813017685?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4988498573813017685/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4988498573813017685' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4988498573813017685'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4988498573813017685'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/lcgutils-bug-closed.html' title='lcg_utils bug closed'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4330365009330790284</id><published>2007-08-10T10:03:00.001Z</published><updated>2007-08-10T10:41:08.829Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Updating Glue schema</title><content type='html'>This is sort of old news as the request of updating the BDII is one month old.&lt;br /&gt;&lt;br /&gt;To update the Glue schema you need to update the BDII on the BDII machine and on the CE and SE (dcache and classic). DPM SE uses BDII instead of globus-mds now so you should check the recipe for that.&lt;br /&gt;&lt;br /&gt;The first problem I found was that&lt;br /&gt;&lt;br /&gt;yum update glite-BDII&lt;br /&gt;&lt;br /&gt;doesn't update the dependencies but only the meta-rpm. Apparently it works for apt-get but not for yum. So if you use yum you have 3 alternatives&lt;br /&gt;&lt;br /&gt;1) yum -y update and risk to screw your machine&lt;br /&gt;2) yum update and check each rpm&lt;br /&gt;3) Look the list of rpms here&lt;br /&gt;&lt;br /&gt;http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-BDII/3.0.2-12/glite-BDII-3.0.2-12.html&lt;br /&gt;&lt;br /&gt;yum update &lt;rpm-list&gt;&lt;br /&gt;&lt;br /&gt;Reconfiguring the BDII doesn't pose a threat so you can&lt;br /&gt;&lt;br /&gt;cd &lt;yaim_home&gt;&lt;br /&gt;./scripts/configure_node &lt;your-site-info.def&gt; BDII_site&lt;br /&gt;&lt;br /&gt;On the CE and SE... you can upgrade the CE and SE and reconfigure the nodes. But I didn't want to do that because you never know what might happen and with the farm full of jobs and the SE being dcache I don't see the point to risk it for a schema upgrade. So what follows is a simple recipe to upgrade the glue schema on CE and SE other than DPM without reconfiguing the nodes.&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;service globus-mds stop&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;yum update glue-schema&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;cd /opt/glue/schema&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;ln -s openldap-2.0 ldap&lt;span style="font-family:monospace;"&gt;&lt;br /&gt;&lt;/span&gt;service globus-mds start&lt;br /&gt;&lt;br /&gt;To check that it worked:&lt;br /&gt;&lt;br /&gt;ps -afx -o etime,args | grep slapd&lt;br /&gt;&lt;br /&gt;if your BDII is not on the CE and you find slapd instances on ports 2171-2173 it means you are running site BDIIs also on your CE and you should turn it off and remove it from the startup services.&lt;br /&gt;&lt;br /&gt;The ldap link is needed because the schema path has changed and unless you want to edit the configuration file (/opt/globus/etc/grid-info-slapd.conf) the easiest thing is to add a link.&lt;br /&gt;&lt;br /&gt;Most of this is in this ticket&lt;br /&gt;&lt;br /&gt;https://gus.fzk.de/pages/ticket_details.php?ticket=24586&amp;amp;from=allt&lt;br /&gt;&lt;br /&gt;including where to find the new schema documentation.&lt;br /&gt;&lt;span style="font-family:Georgia,serif;"&gt;&lt;/span&gt;&lt;/your-site-info.def&gt;&lt;/yaim_home&gt;&lt;/rpm-list&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4330365009330790284?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4330365009330790284/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4330365009330790284' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4330365009330790284'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4330365009330790284'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/updating-glue-schema_10.html' title='Updating Glue schema'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4052792555507334919</id><published>2007-08-09T10:02:00.000Z</published><updated>2007-08-10T10:39:39.263Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Documentation for Manchester local users</title><content type='html'>Yesterday at a meeting with Manchester users who tried to use the grid it turned out that what they missed most is a page to collect the links of information sparse around the world (a common disease).  As a consequence we have started pages to collect information useful to local users to use the grid.&lt;br /&gt;&lt;a href="https://www.gridpp.ac.uk/wiki/Manchester"&gt;&lt;br /&gt;https://www.gridpp.ac.uk/wiki/Manchester&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Current links are of general usefulness. Users will add their own personal tips and tricks later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4052792555507334919?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4052792555507334919/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4052792555507334919' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4052792555507334919'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4052792555507334919'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/documentation-for-manchester-local.html' title='Documentation for Manchester local users'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-3335402426505755126</id><published>2007-08-08T09:07:00.000Z</published><updated>2007-08-10T10:41:08.829Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>How to check accounting is working properly</title><content type='html'>Obviously when you look at the accounting pages at the bottom there is a graph showing running VOs, but that is not straightforward. Other two ways are&lt;br /&gt;&lt;br /&gt;The accounting enforcement page showing sites that are not publishing and for how many days they haven't published.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www3.egee.cesga.es/acctenfor"&gt;http://www3.egee.cesga.es/acctenfor&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;which I linked from&lt;br /&gt;&lt;br /&gt;&lt;a href="https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Accounting"&gt;https://www.gridpp.ac.uk/wiki/Links_Monitoring_pages#Accounting&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;or you could setup &lt;a href="http://goc.grid.sinica.edu.tw/gocwiki/ApelFaq#head-eaa231e4aa4e33b7e6fc2dbe439042a75a704693"&gt;RSS feeds&lt;/a&gt;  as suggested in the Apel FAQ.&lt;br /&gt;&lt;br /&gt;I also created an Apel page with this information on the sysadmin wiki&lt;br /&gt;&lt;a href="http://www.sysadmin.hep.ac.uk/wiki/Apel"&gt;&lt;br /&gt;http://www.sysadmin.hep.ac.uk/wiki/Apel&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-3335402426505755126?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/3335402426505755126/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=3335402426505755126' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3335402426505755126'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/3335402426505755126'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/how-to-check-accounting-is-working.html' title='How to check accounting is working properly'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-4943606530546024047</id><published>2007-08-06T12:07:00.001Z</published><updated>2007-08-10T10:39:39.264Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Progress on SL4</title><content type='html'>As part of our planned upgrade to SL4 at Manchester, we've been looking at getting dcache running. &lt;br /&gt;The biggest stumbling block is a lack of glite-SE_dcache* profile, luckily it seems that all of the needed components apart from dcache-server are in the glite-WN profile. Even the GSIFtp Door appears to work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-4943606530546024047?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/4943606530546024047/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=4943606530546024047' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4943606530546024047'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/4943606530546024047'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/progress-on-sl4.html' title='Progress on SL4'/><author><name>Colin Morey</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-152447326279180168</id><published>2007-08-03T11:04:00.000Z</published><updated>2007-08-10T17:07:55.991Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lancaster'/><title type='text'>Green fields of Lancaster</title><content type='html'>After sending the dcache problem the way of the dodo last week we've been enjoying 100% SAM test passes over the past 7 days. It's nice to have to do next to nothing to fill in your weekly report. Not a very exciting week otherwise, odd jobs and maintenance here and there. Our CE has been very busy the last week, which has caused occasional problems with the Steve Lloyd tests-we've had a few failures due to there being no job slots available, despite measures to prevent that. We'll see if we can improve things.&lt;br /&gt;&lt;br /&gt;We're gearing up for the SL4 move- after Monday's very useful Northgrid meeting at Sheffield we have a time frame for it-sometime during the week starting the 13th of August. We'll pin it down to an exact day at the start of the coming week. We've took a worker offline as a guinea pig and will do hideous SL4 experimentations to it. The whole site will be in downtime for 9-5 on the day we do the move, with luck we won't need that long but we intend to use the time to upgrade the whole site (no SL3 kernels will be left within our domain). Lucky for us Manchester have offered to go first in Northgrid, so we'll have veterans of the SL4 upgrade nearby to call on for assistance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-152447326279180168?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/152447326279180168/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=152447326279180168' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/152447326279180168'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/152447326279180168'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/green-fields-of-lancaster.html' title='Green fields of Lancaster'/><author><name>Matt</name><uri>http://www.blogger.com/profile/16514316941712112895</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-6591733793221447878</id><published>2007-08-02T16:44:00.001Z</published><updated>2007-08-10T10:41:08.830Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>lcg-utils bugs</title><content type='html'>&lt;a href="https://gus.fzk.de/pages/ticket_details.php?ticket=25406"&gt;https://gus.fzk.de/pages/ticket_details.php?ticket=25406&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-6591733793221447878?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/6591733793221447878/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=6591733793221447878' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6591733793221447878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/6591733793221447878'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/lcg-utils-bugs.html' title='lcg-utils bugs'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-1866841153333201496</id><published>2007-08-02T15:17:00.000Z</published><updated>2007-08-10T10:41:08.830Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='NorthGrid'/><title type='text'>Laptop reinstalled</title><content type='html'>EVO didn't work on my laptop. I reinstalled it with latest version of ubuntu and java 1.6.0. It works now. With my great disappointment facebook aquarium still doesn't ;-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-1866841153333201496?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/1866841153333201496/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=1866841153333201496' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1866841153333201496'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/1866841153333201496'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/laptop-reinstalled.html' title='Laptop reinstalled'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4670756400590062347.post-2958887613837854704</id><published>2007-08-02T15:06:00.000Z</published><updated>2007-08-10T10:39:39.264Z</updated><category scheme='http://www.blogger.com/atom/ns#' term='Manchester'/><title type='text'>Fixed Manchester accouting</title><content type='html'>&lt;a href="https://www.ggus.org/pages/ticket_details.php?ticket=25215"&gt;https://www.ggus.org/pages/ticket_details.php?ticket=25215&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4670756400590062347-2958887613837854704?l=northgrid-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://northgrid-tech.blogspot.com/feeds/2958887613837854704/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=4670756400590062347&amp;postID=2958887613837854704' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2958887613837854704'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4670756400590062347/posts/default/2958887613837854704'/><link rel='alternate' type='text/html' href='http://northgrid-tech.blogspot.com/2007/08/fixed-manchester-accouting.html' title='Fixed Manchester accouting'/><author><name>Alessandra Forti</name><uri>http://www.blogger.com/profile/11973932320387024088</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_nB3QAxoLs5o/SUfJ_bcRvXI/AAAAAAAAARk/nlOg9eaXhaY/S220/patyten_seaOttersSwim.jpg'/></author><thr:total>0</thr:total></entry></feed>
