Monday 1 December 2008

RFIO tuning for Atlas analysis jobs

A little info about the RFIO settings we've tested at Liverpool.

Atlas analysis jobs running on a site using DPM use POSIX access through the RFIO interface. ROOT (since v5.16 IIRC) has support for RFIO access and uses the buffered access mode READBUF. This allocates a static buffer for files read via RFIO on the client. By default this buffer is 128kB.

Initial tests with this default buffer size showed a low cpu efficiency and a high rate of bandwidth usage, far more than the size of the files being accessed. The buffer size can be altered by including a file on the client called /etc/shift.conf containing

RFIO IOBUFSIZE XXX

where XXX is the size in bytes. Altering this setting gave the following results

Buffer (MB), CPU (%), Data transferred (GB)
0.125, 60.0, 16.5
1.000, 23.0, 65.5
10.00, 13.5, 174.0
64.00, 62.1, 11.5
128.0, 74.7, 7.5

This was on a test data set with file sizes of ~1.5GB and using athena 14.2.10.

Using buffer sizes of 64MB+ gives gains in efficiency and required bandwidth. A 128MB buffer is a significant chunk of a worker node's RAM, but as the files are not being cached in the linux file cache the ram usage is likely similar to accessing the file from local disk, and the gains are large.

For comparison the same test was run from a copy of the files on local disk. This gave a cpu efficiency of ~50% but the event rate was ~8 times slower than when using RFIO.

My conclusions are that RFIO buffering is significantly more efficient than standard linux file caching. The default buffer size is insufficient and increasing by small amounts greatly reduces efficiency. Increasing the buffer to 64-128MB gives big gains without impacting available RAM too much.

My guess about why only a big buffer gives gains may be due to the random access on the file by the analysis job. Reading in a small chunk, eg 1MB, may buffer a whole event but the next event is unlikely to be in that buffered 1MB, so another 1MB has to be read in for the next event. Similarly for 10MB, although this time the amount read in each time is 10x as much but with a less than 10x increase in probability of the event being in the buffer. When the buffer reaches 64MB the probability of an event being in the buffered area is high enough to offset the extra data being read in.

Another possibility is that the buffering only buffers the first xMB of the file, hence a bigger buffer means more of the file is in RAM and there's a higher probability of the event being in the buffer. Neither of these hypotheses have been investigated further yet.

Large block reads are also more efficient when reading in the data than lots of small random reads. The efficiency effectively becomes 100% if the buffer size is >= the dataset file size; the first reads pull in all of the file and all reads from then are from local RAM.

This makes no difference to the impact on the head node for eg SURL/TURL requests, only the efficiency of the analysis job accessing the data from the pool nodes and the required bandwidth (our local tests simply used the rfio:///dpm/... path directly). If there are enough jobs there will still be bottle necks on the network, either at switch or pool node. We have given all our pool nodes at least 3Gb/s connectivity to the LAN backbone.

The buffer size setting will give different efficiency gains for different file sizes (ie the smaller filesize, the better the efficiency), eg the first atlas analysis test had smaller file sizes than our tests and showed much higher efficiencies. The impact of the BUFSIZE setting on other VO's analysis jobs that use RFIO hasn't been tested.

6 comments:

Unknown said...

Care to check with smaller buffers, going down to 4K if possible?

The most likely reason why local IO is slower is that you never use the pagecache, you only read the data once and since your IO is more or less random you can't expect good performance.

A dd (or something similar) before the jobs starts will load the file in the pagecache but since most files aren't small you might not have enough memory to keep them there and you are back to square one.

The real problem is that the file format isn't efficient and the IO pattern is random, not entirely random for the files you tested since it seems that it's done in blocks of around 128MB but nothing really tells you that other files will be similar so you might need a much bigger buffer there :(

John Bland said...

I've run a quick retest at the extremes, with buffer sizes of 4kB, 128kB and 128MB.

Buffer, Efficiency, Throughput
4kB, 60%, 0.66GB
128kB, 46%, 13.95GB
128MB, 71%, 4.50GB

The much smaller buffer has clear gains in efficiency and bandwidth. I assume this to be that even though the buffer is clearly not big enough for a single event, each read is only for a very small amount. This does harm efficiency however as it is hitting the latency of the network/rfio interface.

The large 128MB buffer uses more bandwidth but gets even better efficiency. Note that these tests are on a file size of 1.5GB (ie 12x the buffer size).

The 4kB route should give a predictable, low bandwidth solution but leaving you at the mercy of your network latency. The 128MB route gives a less predictable (but still significant) increase to cpu efficiency at the expense of some bandwidth.

File sizes less than 128MB give a very similar bandwidth usage to the 4kB setup (as seen during atlas analysis challenge 1) with close to 100% efficiency.

My opinion is that if the bandwidth is there it might as well be sacrificed to give a gain in cpu efficiency (and hence useful analysis/day) but I agree that the base problem is the file layout. If the data was sequential none of this would matter and we could tune BUFSIZE to best match our network and disks.

As for local disk, even though data is only read once the device should be reading ahead by some amount (and so over time building up the whole file), but again as the data is random this gains nothing, particularly if the files/datasets are larger than the cache. Setting a large readahead as per the RFIO buffer might give similar gains but that could seriously affect the rest of the system performance.

All of these findings are on relatively old hardware. Lots of factors such as bus speed, network interface efficiency etc could greatly affect the results (eg the 128kB cpu efficiency is so low partly because the system is spending more system cpu time on moving data across the LAN), so this really needs testing on other sites.

John Bland said...

ps I tried a 256kB buffer out of interest, as this should be enough for one whole event, but the result was even worse than 128kB.

Simon George said...

hi John,
can you say more about the commands you used to make these measurements?
Thanks,
Simon

John Bland said...

Simon,

The commands used are a simple athena job (14.2.10) run over RFIO, on some atlas FDR08 datasets (~1.5G/file). A voms proxy is made available for GSI access and we have a local hack that forces libshift.so to point to libdpm.so.

The wall time and cpu time are recorded with 'time' on that athena job.

Throughput of data is taken from ganglia (bandwidth is steady enough to take a good estimate).

All tests are performed on otherwise idle nodes directly, not through grid/batch.

Jerome Pansanel said...

Hi,

Where do you place the /etc/shift.conf file ? On the WNs or the SEs ?

Did you saw any modification in the transfer rate since last October ?