After I applied 3 of the mysql parameters changes I talk about in this post I didn't see the improvement I was hoping with atlas jobs time outs.
This is another set of optimizations I put together after further search
First of all I started to systematically count the time TIME_WAIT connections every five minutes. I also correlated them in the same log file to the number of concurrent threads the server keeps mostly in sleep mode. You can get the last bit running mysqladmin -p proc stat or from within a mysql command line. The number of threads was near to the max allowed default value in mysql so I doubled that in my.cnf
then I halved the kernel time out for TIME_WAIT connections
sysctl -w net.ipv4.tcp_fin_timeout=30
the default value is 60 sec. If you add it to /etc/sysctl.conf it becomes permanent.
Finally I found this article which explicitly talks about mysql tunings to reduce connection timeouts: Mysql Connection Timeouts and I set the following
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.core.somaxconn=512
again add to /etc/sysctl.conf to make it permanent; and added in my.cnf
I calculated my numbers on 500 connections/s because that's what I have observed when I did all this (I obeserved even larger numbers). Admittedly now they are stable at 330 connections per second but we haven't had any heavy ramp up since Saturday. Only a mild one but that didn't cause any time out. I'm waiting for a serious ramp as definitive test. Said that since Saturday we haven't seen any timeout errors not even the low background that was always present. So there is already an improvement.
Today there was an atlas ramp from almost 0 to >1400 jobs and no time outs so far.
Few timeouts were seen yesterday but they were due to authentication between the head node and a couple of data servers which I will have to investigate but they are a handful, nowhere near the scale observed before and not due to mysql. I will still keep things under observation for a while longer. Just in case.