Tuesday, 3 June 2008

Slaughtering ATLAS jobs?

These heavy ion jobs have generated lots of discussion in various forums. They are ATLAS heavy ion simulations (Pb-Pb) which are being killed in two ways. (1) by the site's batch system if the queue walltime limit is reach. (2) by the atlas pilot because the log file modification time hasn't change in 24 hrs.

Either way, sites shouldn't worry if they see these, ATLAS production is aware. They're only single event and you might see massive mem usage too > 2G/core. :-)

According to Steve, the new WN should allow jobs to gracefully handle batch kills with a suitable delay between SIGTERM and SIGKILL.

No comments: