Northgrid-tech: DPM optimization next round

After I applied 3 of the mysql parameters changes I talk about in this post I didn't see the improvement I was hoping with atlas jobs time outs.

This is another set of optimizations I put together after further search

First of all I started to systematically count the time TIME_WAIT connections every five minutes. I also correlated them in the same log file to the number of concurrent threads the server keeps mostly in sleep mode. You can get the last bit running mysqladmin -p proc stat or from within a mysql command line. The number of threads was near to the max allowed default value in mysql so I doubled that in my.cnf

max_connections=200

then I halved the kernel time out for TIME_WAIT connections

sysctl -w net.ipv4.tcp_fin_timeout=30

the default value is 60 sec. If you add it to /etc/sysctl.conf it becomes permanent.

Finally I found this article which explicitly talks about mysql tunings to reduce connection timeouts: Mysql Connection Timeouts and I set the following

sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.core.somaxconn=512

again add to /etc/sysctl.conf to make it permanent; and added in my.cnf

back_log=500

I calculated my numbers on 500 connections/s because that's what I have observed when I did all this (I obeserved even larger numbers). Admittedly now they are stable at 330 connections per second but we haven't had any heavy ramp up since Saturday. Only a mild one but that didn't cause any time out. I'm waiting for a serious ramp as definitive test. Said that since Saturday we haven't seen any timeout errors not even the low background that was always present. So there is already an improvement.

Update 16/06/2011

Today there was an atlas ramp from almost 0 to >1400 jobs and no time outs so far.

Few timeouts were seen yesterday but they were due to authentication between the head node and a couple of data servers which I will have to investigate but they are a handful, nowhere near the scale observed before and not due to mysql. I will still keep things under observation for a while longer. Just in case.

Northgrid-tech

Tuesday, 14 June 2011

DPM optimization next round

No comments:

Contributors