A busy week for Lancaster on the SE front. We had the "PNFS move", where the PNFS services were moved from the admin node onto their own host. There were complications, mainly caused by the fact that several key details were missing from the recipe I found was missing 1 or 2 key details that I overlooked when preparing for it.
I am going to wikify my fun and games, but essentially my problems can be summed up as:
Make sure in the node_config both the admin and pnfs nodes are marked down as "custom" in their node type. Keeping the admin node as "admin" causes it to want to run PNFS as well.
In the pnfs exports directory make sure the srm node is in there, and that on the srm node the pnfs directory on the pnfs node is mounted (similar to how door nodes are mounted-although not quite the same-to be honest I'm not sure I have it right but it seems to work.
Start things in the right order- the pnfs server on the PNFS node, the dcache-core services on the admin node, then the PNFSmanager on the PNFS node. I found that on a restart of the admin node services I had to restart the PNFSmanager. I'm not sure how I can fix this to enable automatic startups of our dcache services in the correct order.
Make sure that postgres is running on the admin node- it won't produce an error on startup if postgres isn't up (as it would have done if running pnfs on the node), but it will simply not work when you attempt transfers.
Don't do things with a potential to go wrong on a Friday afternoon if you can avoid it!
Since the move we have yet to see a significant performance increase, but then it's yet to be seriously challenged. We performed some more postgres housekeeping on the admin node after the move which made it a lot happier. Since the move we have noticed occasional srm SFT failures with a "permission denied" type failure, although checking things in the pnfs namespace we don't see any glaring ownership errors. I'm investigating it.
We have had some other site problems this week caused by the timing to be off on several nodes to be off by a good few minutes. It seems Lancaster's ntp server is unwell.
The room where we keep our Pool Nodes is suffering from heat issues. This always leaves us on edge, as our SE has had to be shut down before because of this, and the heat can make things flakey. Hopefully that machine room will get more cooling power and soon.
Other site news from Peter:
Misconfigured VO_GEANT4_SW_DIR caused some WNs to have a full /
partition, becoming blackholes. On top of this, a typo (extra quote) in
the site-info.conf caused lcg-env.sh to be messed up, failing jobs
immediately. Fixed now but flags up how sensitive the system is to
tweaks. Our most stable production month was when we implemented a