i/oat

Contents


I/OAT

I/OAT (I/O Acceleration Technology) is the name for a collection of techniques by Intel to improve network throughput. The most significant of these is the DMA engine. The DMA engine is meant to offload from the CPU the copying of SKB data to the user buffers. This is not a zero-copy receive, but does allow the CPU to do other work while the copy operations are performed by the DMA engine.


Implementation on Linux

The I/OAT patch series consists of three general areas. First, it adds a DMA subsystem to the kernel, which abstracts the DMA engine hardware from users of it. Second, it adds the I/OAT hardware driver, which plugs into the DMA subsystem and handles controlling the actual hardware. Finally, it implements a series of modifications to the network stack to make use of asynchronous copy offload.


Net stack modifications

The net stack modifications, given that they touch very important code, have recieved the most scrutiny. Significant changes:

  1. Data members have been added (to struct sk_buff and struct sock_common most notably)
  2. sk_eat_skb() has an added parameter
  3. tcp_recvmsg(): Code added to pin user buffer memory on entry. Code added to wait for async copies to complete, and unpin memory, before exiting.
  4. tcp_rcv_established(): Code added to initiate async copies if possible. dma_try_early_copy() added to tcp.c.


Patches

Updated to ioat-1.7 and netdev latest git (20060508)

  1. DMA subsystem
  2. HW driver
  3. set up net as DMA client
  4. utility functions
  5. structure changes
  6. rename cleanup_rbuf and make non-static
  7. modify sk_eat_skb
  8. add sysctl for copy size tuning
  9. modify the stack to do recv copy offload


Kernel acceptance status

Intel presented technical information at OLS 2005 (but no code.) Posted all code but HW driver for review November 2005. Posted updated patch with HW driver March 3 2006, and again incorporating dev community feedback March 29 2006.

I/OAT has been queued for 2.6.18.


Performance data

This is the initial data we posted to netdev March 16 2006.

initial Chariot portscaling without data access

This is more Chariot data, but also includes results with its data verification on, thus touching the data. The CPU gap is narrower (esp on 8 port) but still noteworthy.

later Chariot portscaling with and without data access

This data shows that I/OAT really benefits from larger application buffer sizes. There is a CPU spike at 2K, although also increased throughput. This could be eliminated by increasing the tcp_dma_copybreak sysctl ("cat 4096 > /proc/sys/net/ipv4/tcp_dma_copybreak"), which disables I/OAT at or below that application buffer size.

Chariot using different application buffer sizes

This shows netperf performance. Notice we are using fewer clients than the Chariot tests. A slight CPU savings at higher application buffer sizes, but less noteworthy than Chariot.

netperf using different application buffer sizes

This data shows 6 individual runs of Tbench, showing 7-10% drop in CPU utilization.

Tbench showing CPU utilization across six runs

Results from SPECWeb. Since this is a TX test, I/OAT should not impact performance, and these indicate it doesn't.

SPECWeb with no I/OAT

SPECWeb with I/OAT enabled

This data shows results with different numbers of ports. It includes both standard netperf data, as well as results using a new option only present in netperf's SVN repo that touches the data after it is received.

netperf showing port scaling with touched data

Groups: