RDMA over IB is twice as slow as TCP with IP over IB

abegorov · February 13, 2026, 2:21pm

DRBD 9.3.0

The RDMA transport driver is very slow on modern hardware like NVIDIA ConnectX-7. It struggles to utilize the adapter’s (400 Gb/s) bandwidth. It appears that sndbuf-size and rcvbuf-size maximum values are too small. It is impossible to specify buffer sizes larger than 25M (sndbuf-size + rcvbuf-size + rdma-ctrl-sndbuf-size + rdma-ctrl-rcvbuf-size = 52424K max), and even with 25M buffers, I see thousands of Not sending flow_control msg, no receive window messages during heavy load.

Additionally, there are many messages about switching to WQ_UNBOUND in dmesg (for both RDMA and TCP transports):

workqueue: iomap_dio_complete_work hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND

Even when switching the transport from RDMA to TCP (using IP over Infiniband), I observe two times the performance improvement.

I have configured 24 DRBD resources. For each DRBD instance, I allocated a dedicated CPU core (reserved only for DRBD):

resource "disk0" {
    device minor 0;
    disk "/dev/disk/by-partlabel/drbd0";
    meta-disk internal;
    options {
        cpu-mask "00000000,00000000,00000001";
    }
    on "hosta" {
        node-id 0;
    }
    on "hostb" {
        node-id 1;
    }
    connection {
        host "hosta" address 192.168.1.1:7789;
        host "hostb" address 192.168.1.2:7789;
    }
}
...
resource "disk23" {
    device minor 23;
    disk "/dev/disk/by-partlabel/drbd23";
    meta-disk internal;
    options {
        cpu-mask "00000000,00000008,00000000";
    }
    on "hosta" {
        node-id 0;
    }
    on "hostb" {
        node-id 1;
    }
    connection {
        host "hosta" address 192.168.1.1:7812;
        host "hostb" address 192.168.1.2:7812;
    }
}

For my write performance test, I used this FIO script:

[seq-write-24]
filename=/disks/0/fio:/disks/1/fio:/disks/2/fio:/disks/3/fio:/disks/4/fio:/disks/5/fio:/disks/6/fio:/disks/7/fio:/disks/8/fio:/disks/9/fio:/disks/10/fio:/disks/11/fio:/disks/12/fio:/disks/13/fio:/disks/14/fio:/disks/15/fio:/disks/16/fio:/disks/17/fio:/disks/18/fio:/disks/19/fio:/disks/20/fio:/disks/21/fio:/disks/22/fio:/disks/23/fio
file_service_type=roundrobin:1
size=2400G
rw=write
bs=1M
direct=1
buffered=0
numjobs=24
time_based=1
runtime=60
ioengine=libaio
iodepth=24
group_reporting

With RDMA and this configuration, I got 12.2 GB/s overall:

net {
    protocol C;
    transport "rdma";
    sndbuf-size 25M;
    rcvbuf-size 25M;
    rdma-ctrl-sndbuf-size 600K;
    rdma-ctrl-rcvbuf-size 600K;
    max-buffers 128K;
    max-epoch-size 20000;
}

With TCP and this configuration, I got 24.3 GB/s overall:

net {
    protocol C;
    transport "tcp";
    sndbuf-size 128M;
    rcvbuf-size 128M;
    max-buffers 128K;
    max-epoch-size 20000;
}

Without replication (DRBD down on hostb), I achieved 129 GB/s

kermat · February 23, 2026, 3:41pm

Thank you for sharing these tests and results.

I have ConnectX-5 cards available for my own testing and typically see around ~15% better performance when using RDMA rather than TCP (IPoIB), but I usually test random writes with small block sizes. When testing with sequential writes and large block sizes, I am also seeing that RDMA is ~3% slower than TCP. So not as drastic as your results, but still counter intuitive.

I will ask the LINBIT kernel module developers for input and report back with any insight.

abegorov · February 24, 2026, 8:09am

Thank you for your feedback and for investigating further.

I believe the core issue comes down to buffer limitations. Sending 25 MB over a 400 Gb/s link takes approximately half a millisecond, which means a single RDMA connection cannot keep up with the throughput.

The real bottleneck appears to be the maximum number of Work Requests per RDMA connection, which constrains the total buffer size. With a Scatter-Gather Element of 8, the limit is 13,106 WRs. Assuming 4 KB buffers, the effective buffer capacity (send + receive) reaches only about 51 MB.

I believe the problem could potentially be addressed in two ways:

Use multiple RDMA connections - NVMe over Fabrics (RDMA) uses one connection per CPU, which might distribute the load more effectively
Increase buffer sizes per Work Request - using buffers larger than 4 KB could maximize the WR capacity.

kermat · March 20, 2026, 9:24pm

Sorry for taking so long to reply. Life gets busy and I forget to check in on the forums, and then here we are 25 days later

I did want to let you know that I did speak with our kernel module developer, and learned that we’re aware of some shortcomings in the RDMA transport, specifically with replicating larger writes, that will require a significant effort to address.

Basically, when the RDMA module was initially created 10+ years ago, it seemed like a good idea to chunk up data into 4K pages before transmitting. What we know now is that ring buffers are a better approach, since they can transfer up to 1MiB in a single operation. I believe this is along the same lines as your 2nd suggestion, of increasing the buffer sizes per request, so I do think you and our kernel module developers are at least on the same page there.

Topic		Replies	Views
RDMA, net.core.r/wmem_max and max-buffers General drbd	0	48	December 19, 2025
Performance Regression in DRBD RDMA Between 9.3.0 and 9.3.1 DRBD drbd	3	121	April 16, 2026
DRBD for bandwidth intensive replication DRBD drbd	15	491	January 16, 2026
Is RDMA possible with a TCP TieBreaker? DRBD drbd , linstor	5	260	December 17, 2024
Configure drbd/rdma for diskful and tcp for diskless nodes DRBD drbd	0	317	December 20, 2024

RDMA over IB is twice as slow as TCP with IP over IB

Related topics