Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:
"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"
However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:
The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.