Comments on Northgrid-tech: RFIO tuning for Atlas analysis jobs

Hi, Where do you place the /etc/shift.conf file ?...

2011-05-25T08:21:28.924+00:00

Hi,

Where do you place the /etc/shift.conf file ? On the WNs or the SEs ?

Did you saw any modification in the transfer rate since last October ?

Simon,The commands used are a simple athena job (1...

2009-02-10T12:30:00.000+00:00

Simon,

The commands used are a simple athena job (14.2.10) run over RFIO, on some atlas FDR08 datasets (~1.5G/file). A voms proxy is made available for GSI access and we have a local hack that forces libshift.so to point to libdpm.so.

The wall time and cpu time are recorded with 'time' on that athena job.

Throughput of data is taken from ganglia (bandwidth is steady enough to take a good estimate).

All tests are performed on otherwise idle nodes directly, not through grid/batch.

hi John,can you say more about the commands you us...

2009-02-09T14:55:00.000+00:00

hi John,
can you say more about the commands you used to make these measurements?
Thanks,
Simon

ps I tried a 256kB buffer out of interest, as this...

2008-12-02T12:45:00.000+00:00

ps I tried a 256kB buffer out of interest, as this should be enough for one whole event, but the result was even worse than 128kB.

I've run a quick retest at the extremes, with buff...

2008-12-02T12:33:00.000+00:00

I've run a quick retest at the extremes, with buffer sizes of 4kB, 128kB and 128MB.

Buffer, Efficiency, Throughput
4kB, 60%, 0.66GB
128kB, 46%, 13.95GB
128MB, 71%, 4.50GB

The much smaller buffer has clear gains in efficiency and bandwidth. I assume this to be that even though the buffer is clearly not big enough for a single event, each read is only for a very small amount. This does harm efficiency however as it is hitting the latency of the network/rfio interface.

The large 128MB buffer uses more bandwidth but gets even better efficiency. Note that these tests are on a file size of 1.5GB (ie 12x the buffer size).

The 4kB route should give a predictable, low bandwidth solution but leaving you at the mercy of your network latency. The 128MB route gives a less predictable (but still significant) increase to cpu efficiency at the expense of some bandwidth.

File sizes less than 128MB give a very similar bandwidth usage to the 4kB setup (as seen during atlas analysis challenge 1) with close to 100% efficiency.

My opinion is that if the bandwidth is there it might as well be sacrificed to give a gain in cpu efficiency (and hence useful analysis/day) but I agree that the base problem is the file layout. If the data was sequential none of this would matter and we could tune BUFSIZE to best match our network and disks.

As for local disk, even though data is only read once the device should be reading ahead by some amount (and so over time building up the whole file), but again as the data is random this gains nothing, particularly if the files/datasets are larger than the cache. Setting a large readahead as per the RFIO buffer might give similar gains but that could seriously affect the rest of the system performance.

All of these findings are on relatively old hardware. Lots of factors such as bus speed, network interface efficiency etc could greatly affect the results (eg the 128kB cpu efficiency is so low partly because the system is spending more system cpu time on moving data across the LAN), so this really needs testing on other sites.

Care to check with smaller buffers, going down to ...

2008-12-01T21:57:00.000+00:00

Care to check with smaller buffers, going down to 4K if possible?

The most likely reason why local IO is slower is that you never use the pagecache, you only read the data once and since your IO is more or less random you can't expect good performance.

A dd (or something similar) before the jobs starts will load the file in the pagecache but since most files aren't small you might not have enough memory to keep them there and you are back to square one.

The real problem is that the file format isn't efficient and the IO pattern is random, not entirely random for the files you tested since it seems that it's done in blocks of around 128MB but nothing really tells you that other files will be similar so you might need a much bigger buffer there :(