Monday 10 November 2008

DPM "File Not Found"- but it's right there!

Lancaster's been having a bad run with atlas jobs the last few weeks. We've been failing jobs with error messages like:
"/dpm/lancs.ac.uk/home/atlas/atlasproddisk/panda/panda.60928294-7fd4-4007-af5e-4cdc3c8934d3_dis18552080/HITS.024853._00326.pool.root.2 : No such file or directory (error 2 on fal-pygrid-35.lancs.ac.uk)"

However when we break out the great dpm tools and track down this file to disk, it's right where it should be, with correct permissions, size and age- not even attempting to hide. The log files show nothing directly exciting, although there are a lot of deletes going on. Ganglia comes up with something a little more interesting on the dpm head node- heavy use of swap memory and erratic high CPU use. A restart of the dpm and dpns services seems to have calmed things somewhat this morning, but mem usage is climbing pretty fast:

http://fal-pygrid-17.lancs.ac.uk:8123/ganglia/?r=day&sg=&c=LCG-ServiceNodes&h=fal-pygrid-30.lancs.ac.uk

The next step is for me to go shopping for RAM, the headnode is a sturdy box but only has 2GB, an upgrade to 4 should give us more breathing space. But the real badness comes from the fact that all this swapping should decrease performance but not lead to the situation we have where the dpm databases, under high load, seem to return false negatives to queries about files- telling us they don't exist when they're right there on disk where they should be.

1 comment:

Unknown said...

I had an error recently when trying to open a file ( Aistė ir Roberta - Nebraidžiosim po rasą.mp3 ) with Billy Mp3 player. It threw up an error saying file not found, but the file was actually there on my hard drive, showing plainly in windows explorer. I later came to find out some programs don't understand path names with languages specific symbols because it opened fine in Windows Media Player. So if this type of error happens to you, try opening the file with a different program that handles that same file type.