RDMA over IB is twice as slow as TCP with IP over IB

DRBD 9.3.0

The RDMA transport driver is very slow on modern hardware like NVIDIA ConnectX-7. It struggles to utilize the adapter’s (400 Gb/s) bandwidth. It appears that sndbuf-size and rcvbuf-size maximum values are too small. It is impossible to specify buffer sizes larger than 25M (sndbuf-size + rcvbuf-size + rdma-ctrl-sndbuf-size + rdma-ctrl-rcvbuf-size = 52424K max), and even with 25M buffers, I see thousands of Not sending flow_control msg, no receive window messages during heavy load.

Additionally, there are many messages about switching to WQ_UNBOUND in dmesg (for both RDMA and TCP transports):

workqueue: iomap_dio_complete_work hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND

Even when switching the transport from RDMA to TCP (using IP over Infiniband), I observe two times the performance improvement.

I have configured 24 DRBD resources. For each DRBD instance, I allocated a dedicated CPU core (reserved only for DRBD):

resource "disk0" {
    device minor 0;
    disk "/dev/disk/by-partlabel/drbd0";
    meta-disk internal;
    options {
        cpu-mask "00000000,00000000,00000001";
    }
    on "hosta" {
        node-id 0;
    }
    on "hostb" {
        node-id 1;
    }
    connection {
        host "hosta" address 192.168.1.1:7789;
        host "hostb" address 192.168.1.2:7789;
    }
}
...
resource "disk23" {
    device minor 23;
    disk "/dev/disk/by-partlabel/drbd23";
    meta-disk internal;
    options {
        cpu-mask "00000000,00000008,00000000";
    }
    on "hosta" {
        node-id 0;
    }
    on "hostb" {
        node-id 1;
    }
    connection {
        host "hosta" address 192.168.1.1:7812;
        host "hostb" address 192.168.1.2:7812;
    }
}

For my write performance test, I used this FIO script:

[seq-write-24]
filename=/disks/0/fio:/disks/1/fio:/disks/2/fio:/disks/3/fio:/disks/4/fio:/disks/5/fio:/disks/6/fio:/disks/7/fio:/disks/8/fio:/disks/9/fio:/disks/10/fio:/disks/11/fio:/disks/12/fio:/disks/13/fio:/disks/14/fio:/disks/15/fio:/disks/16/fio:/disks/17/fio:/disks/18/fio:/disks/19/fio:/disks/20/fio:/disks/21/fio:/disks/22/fio:/disks/23/fio
file_service_type=roundrobin:1
size=2400G
rw=write
bs=1M
direct=1
buffered=0
numjobs=24
time_based=1
runtime=60
ioengine=libaio
iodepth=24
group_reporting

With RDMA and this configuration, I got 12.2 GB/s overall:

net {
    protocol C;
    transport "rdma";
    sndbuf-size 25M;
    rcvbuf-size 25M;
    rdma-ctrl-sndbuf-size 600K;
    rdma-ctrl-rcvbuf-size 600K;
    max-buffers 128K;
    max-epoch-size 20000;
}

With TCP and this configuration, I got 24.3 GB/s overall:

net {
    protocol C;
    transport "tcp";
    sndbuf-size 128M;
    rcvbuf-size 128M;
    max-buffers 128K;
    max-epoch-size 20000;
}

Without replication (DRBD down on hostb), I achieved 129 GB/s

1 Like

Thank you for sharing these tests and results.

I have ConnectX-5 cards available for my own testing and typically see around ~15% better performance when using RDMA rather than TCP (IPoIB), but I usually test random writes with small block sizes. When testing with sequential writes and large block sizes, I am also seeing that RDMA is ~3% slower than TCP. So not as drastic as your results, but still counter intuitive.

I will ask the LINBIT kernel module developers for input and report back with any insight.

Thank you for your feedback and for investigating further.

I believe the core issue comes down to buffer limitations. Sending 25 MB over a 400 Gb/s link takes approximately half a millisecond, which means a single RDMA connection cannot keep up with the throughput.

The real bottleneck appears to be the maximum number of Work Requests per RDMA connection, which constrains the total buffer size. With a Scatter-Gather Element of 8, the limit is 13,106 WRs. Assuming 4 KB buffers, the effective buffer capacity (send + receive) reaches only about 51 MB.

I believe the problem could potentially be addressed in two ways:

  1. Use multiple RDMA connections - NVMe over Fabrics (RDMA) uses one connection per CPU, which might distribute the load more effectively
  2. Increase buffer sizes per Work Request - using buffers larger than 4 KB could maximize the WR capacity.