Friday 20 July 2012

Jobs with memory leaks containment

This week some sites suffered from extreme memory hungry jobs using up to 16GB of memory and killing the nodes. These were most likely due to memory leaks. The user cancelled all of them before he was even contacted but not before he created some annoyance.

We have had some discussion about how to fix this and atlas so far has asked not to limit on memory because their jobs use for brief periods of time more than what is officially requested. And this is true most of their jobs do this infact. According to the logs the production jobs use up to ~3.5GB mem and slightly less than 5GB vmem. See plot below for one random day (other days are similar).

To avoid killing everything but still putting a barrier against the memory leaks what I'm going to do in Manchester is to limit for mem to 4GB and a limit for vmem to 5GB.

If you are worried about memory leaks you might want to go through a similar check. If you are not monitoring your memory consumption on a per job basis you can parse your logs. For PBS I used this command to produce the plot above

grep atlprd /var/spool/pbs/server_priv/accounting/20120716| awk '{ print $17, $19, $20}'| grep status=0|cut -f3,4 -d'='| sed 's/resources_used.vmem=//'|sort -n|sed 's/kb//g'
 

numbers are already sorted in numerical order so the last one is the highest (mem,vmem) a job has used that day. atlprd is the atlas production group which you can replace with other groups.  Atlas users jobs have up to a point similar usage and then every day you might find a handful crazy numbers like 85GB vmem and 40GB mem. These are the jobs we aim at killing.

I thought the batch system was simplest way because it is only two commands in PBS but after lot of reading and a week of testing it is not possible to over allocate memory without affecting the scheduling and ending up with less jobs on the nodes. This is what I found out:

There are various memory parameters that can be set in PBS:

(p)vmem: virtual memory. PBS doesn't interpret vmem as the almost unlimited address space. If you set this value it will interpret it for scheduling purposes as memory+swap available. It might be different with later versions but that's what happens in torque 2.3.6.
(p)mem: physical memory: that's you RAM.

when there is a p in front it means per process rather than per job

If you set them what happens is as follows:

ALL: if a job arrives without memory settings the batch system will assign these limits as allocated memory for the job not only as a limit the job doesn't have to exceed.
ALL: if a job arrives with memory resources settings that exceed the limits it will be rejected.
(p)vmem,pmem: if a job exceeds the settings at run time it will be killed as these parameters set limits at OS level.
mem: if a job exceeds this limit at run time it will not get killed. This is due to a change in the libraries apparently.

To check how the different parameters affect the jobs you can submit directly to pbs this csh command and play with the parameters

echo 'csh -c limit' | qsub -l vmem=5000000kb,pmem=1GB,mem=2GB,nodes=1:ppn=2

If you want to set these parameters you have to do the following

qmgr
qmgr: set queue long resources_max.vmem = 5gb
qmgr: set queue long resources_max.mem = 4gb
qmgr: set queue long resources_max.pmem = 4gb

These settings will affect the whole queue so if you are worried about other VOs you might want to check what sort of memory usage they have. Although I think only CMS might have a similar usage. I know for sure Lhcb uses less. And as said above this will affect the scheduling.

Update 02/08/2012

RAL and Nikhef use a maui parameter to correct the  the over allocation problem

NODEMEMOVERCOMMITFACTOR         1.5

this will cause maui to allocate up to 1.5 times more memory than there is on the nodes. So if a machine has 2GB memory a 1.5 factor allows to allocate 3GB. Same with other memory parameters described above. The factor can of course be tailored to your site.

On the atlas side there is a memory parameter that can be set in panda. It sets a ulimit on vmem on a per process basis in the panda wrapper. It didn't seem to have an effect on the memory seen by the batch system but that might be because forked processes are double counted by PBS which opens a whole different can of worms.