A little info about the RFIO settings we've tested at Liverpool.
Atlas analysis jobs running on a site using DPM use POSIX access through the RFIO interface. ROOT (since v5.16 IIRC) has support for RFIO access and uses the buffered access mode READBUF. This allocates a static buffer for files read via RFIO on the client. By default this buffer is 128kB.
Initial tests with this default buffer size showed a low cpu efficiency and a high rate of bandwidth usage, far more than the size of the files being accessed. The buffer size can be altered by including a file on the client called /etc/shift.conf containing
RFIO IOBUFSIZE XXX
where XXX is the size in bytes. Altering this setting gave the following results
Buffer (MB), CPU (%), Data transferred (GB)
0.125, 60.0, 16.5
1.000, 23.0, 65.5
10.00, 13.5, 174.0
64.00, 62.1, 11.5
128.0, 74.7, 7.5
This was on a test data set with file sizes of ~1.5GB and using athena 14.2.10.
Using buffer sizes of 64MB+ gives gains in efficiency and required bandwidth. A 128MB buffer is a significant chunk of a worker node's RAM, but as the files are not being cached in the linux file cache the ram usage is likely similar to accessing the file from local disk, and the gains are large.
For comparison the same test was run from a copy of the files on local disk. This gave a cpu efficiency of ~50% but the event rate was ~8 times slower than when using RFIO.
My conclusions are that RFIO buffering is significantly more efficient than standard linux file caching. The default buffer size is insufficient and increasing by small amounts greatly reduces efficiency. Increasing the buffer to 64-128MB gives big gains without impacting available RAM too much.
My guess about why only a big buffer gives gains may be due to the random access on the file by the analysis job. Reading in a small chunk, eg 1MB, may buffer a whole event but the next event is unlikely to be in that buffered 1MB, so another 1MB has to be read in for the next event. Similarly for 10MB, although this time the amount read in each time is 10x as much but with a less than 10x increase in probability of the event being in the buffer. When the buffer reaches 64MB the probability of an event being in the buffered area is high enough to offset the extra data being read in.
Another possibility is that the buffering only buffers the first xMB of the file, hence a bigger buffer means more of the file is in RAM and there's a higher probability of the event being in the buffer. Neither of these hypotheses have been investigated further yet.
Large block reads are also more efficient when reading in the data than lots of small random reads. The efficiency effectively becomes 100% if the buffer size is >= the dataset file size; the first reads pull in all of the file and all reads from then are from local RAM.
This makes no difference to the impact on the head node for eg SURL/TURL requests, only the efficiency of the analysis job accessing the data from the pool nodes and the required bandwidth (our local tests simply used the rfio:///dpm/... path directly). If there are enough jobs there will still be bottle necks on the network, either at switch or pool node. We have given all our pool nodes at least 3Gb/s connectivity to the LAN backbone.
The buffer size setting will give different efficiency gains for different file sizes (ie the smaller filesize, the better the efficiency), eg the first atlas analysis test had smaller file sizes than our tests and showed much higher efficiencies. The impact of the BUFSIZE setting on other VO's analysis jobs that use RFIO hasn't been tested.