Thursday 9 August 2012

10GBE Network cards installation in Manchester

This is a collection of recipes I used to install the 10GBE cards. As I said in a previous post we chose to go 10GBASE-T so we bought X520-T2. They use the same chipset as the X520-DA2 so many things are in common.

The new DELL R610 and C6100 were delivered with the cards already installed. Although due to the fact that DA2 and T2 share the same chipset the C6100 were delivered with the wrong connectors so we are now waiting for a replacement. For the old Viglen WNs and storage we bought additional cards that have to be inserted one by one.

I started the installation process from the R610 because a) they had the cards and b) the perfsonar machines are R610. The aim is to use these cards as primaries and kickstart from them. By default pxe booting is not enabled. So one has to get bootutil from the intel site . What one downloads is for some reason a windows executable but once it is unpacked there are directories for other operating systems. The easiest thing to do is what Andrew has done to zip the unpacked directory and use bootutil from the machine without fussing around with USBs or boot disks. Said that it needs the kernel source to compile. You need to make sure you install the same version as the running kernel.

yum install kernel-devel(-running-kernel-version)
unzip APPS.zip
cd APPS/BootUtils/Linux_x86/
chmod 755 ./install
./install
./bootutil64e -BOOTENABLE=pxe -ALL
./bootutil64e  -UP=Combo -FILE=../BootIMG.FLB -ALL

The first bootutil command enables pxe the second updates the firmware.
After this you can reboot and enter the bios to rearrange the order of the network devices to boot from. When this is  done you can put the 10GBE interface mac address in the dhcp and reinstall from there.

At kickstart time there are some problems with the machine changing the order of the cards you can solve that using ipappend 2 and ksdevice=bootif as suggested in the RH docs in the pxelinux.cfg files. Thanks to Ewan for pointing that out.

Still the machine might not come back up with the interface working. There might be two problems here:

1) X520-T2 interface take longer to wake up than their little 1GBE sisters. It is necessary to insert a delay after the /sbin/ip command in the network scripts. To do this I didn't have to hack anything, I could just set

LINKDELAY=10

in the ifcfg-eth* configuration files and it worked.

2) It is not guarantueed  to have the 10GBE interface as eth0. There are many ways to stop this from happening.

One is to make sure HWADRR if ifcfg-eth0 is assigned the mac address value of the card the administrator want and not what the system decides. It can be done at kickstart time but this might mean to have a kickstart file for each machine which we are trying to get away from.

Dan and Chris suggested this might be corrected with udev The recipe they gave me was this

cat /etc/udev/rules.d/70-persistent-net.rules
KERNEL=="eth*", ID=="0000:01:00.0", NAME="eth0"
KERNEL=="eth*", ID=="0000:01:00.1", NAME="eth1"
KERNEL=="eth*", ID=="0000:04:00.0", NAME="eth2"
KERNEL=="eth*", ID=="0000:04:00.1", NAME="eth3"


and uses the pci device ID value which is the same for the same machine types (R610, C6100...). You can get the ID values using lspci | grep Eth. Not essential but if lspci returns something like Unknown device 151c (rev01) in the description it is just the pci database that is not up to date use update-pciid to update the database. There are other recipes around if you don't like this one, but this simplifies a lot the maintenance of the interfaces naming scheme.

The udev recipe doesn't work if HWADDR are set in the ifcfg-eth* files.  If they are you need to remove them to make udev work. A quick way to do this in every file is

sed -i -r '/^HWADDR.*$/d' ifcfg-eth*

in the post kickstart and then install the udev file.

10GBE cards might need different TCP tuning in /etc/sysctl.conf for now I took the perfsonar machine one which is similar to something already discussed long time ago.

net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 87380 16777216
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_congestion_control = htcp


The effects of moving to 10GBE can be seen very well in the perfsonar tests.