Saturday, 14 January 2012

DPM database file systems synchronization

The synchronisation of the DPM database with the data servers file systems has been a long standing issue.  Last week we had a crash that made more imperative to check all the files and I eventually wrote a bash script that makes use of the GridPP DPM admin tools. I don't think this should be the final version but I'm quicker with bash than with python and therefore I  started with that. Hopefully later in the year I'll have more time to write a cleaner version in python that can be inserted in the admin tools based on this one. It does the following:

1) Create a list of files that are in the DB but not on disk
2) Create a list of files that are on disk but not in the DB
3) Create a list of SURLs from the list of files in the DB but not on disk to declare lost (this is mostly for atlas but could be used by LFC administrators for other VOs)
4) If not in dry run mode proceed to delete the orphan files and the orphan entries in the DB.
5) Print stats of how many files were in either list.

Although I put few protections this script should be run with care and unless in dry run mode shouldn't be run automatically AT ALL. However in dry run mode it will tell you how many files are lost and it is a good metric to monitor regularly as well as when there is a big crash.

If you want to run it, it has to run on the data servers where there is access to the file system. As it is now it requires a modified version of /opt/lcg/etc/DPMINFO that point to the head node rather than localhost because one of the admin tools used does a direct mysql query. For the same reason it also requires dpminfo user to have mysql select privileges from the data servers. This is the part that really could benefit from a rewriting in python and perhaps a proper API use as the other tool does. I also had to heavily parse the output of the tools which weren't created exactly for this purpose and this could also be avoided in a python script. There are no options but all the variables that could be options to customize the script with your local settings (head node, fs mount point, dry_run) are easily found at the top.

To create the lists it takes really little time no more than 3 minutes on my system but it depends mostly on how busy is your head node.

If you want to do a cleanup instead it is proportional to how many files have been lost and can take several hours since it does one DB operation per file. The time to delete the orphan files also depends on how many and how big they are but should take less than DB cleanup.

The script is here: