DRBD 9.3.0
The RDMA transport driver is very slow on modern hardware like NVIDIA ConnectX-7. It struggles to utilize the adapter’s (400 Gb/s) bandwidth. It appears that sndbuf-size and rcvbuf-size maximum values are too small. It is impossible to specify buffer sizes larger than 25M (sndbuf-size + rcvbuf-size + rdma-ctrl-sndbuf-size + rdma-ctrl-rcvbuf-size = 52424K max), and even with 25M buffers, I see thousands of Not sending flow_control msg, no receive window messages during heavy load.
Additionally, there are many messages about switching to WQ_UNBOUND in dmesg (for both RDMA and TCP transports):
workqueue: iomap_dio_complete_work hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
Even when switching the transport from RDMA to TCP (using IP over Infiniband), I observe two times the performance improvement.
I have configured 24 DRBD resources. For each DRBD instance, I allocated a dedicated CPU core (reserved only for DRBD):
resource "disk0" {
device minor 0;
disk "/dev/disk/by-partlabel/drbd0";
meta-disk internal;
options {
cpu-mask "00000000,00000000,00000001";
}
on "hosta" {
node-id 0;
}
on "hostb" {
node-id 1;
}
connection {
host "hosta" address 192.168.1.1:7789;
host "hostb" address 192.168.1.2:7789;
}
}
...
resource "disk23" {
device minor 23;
disk "/dev/disk/by-partlabel/drbd23";
meta-disk internal;
options {
cpu-mask "00000000,00000008,00000000";
}
on "hosta" {
node-id 0;
}
on "hostb" {
node-id 1;
}
connection {
host "hosta" address 192.168.1.1:7812;
host "hostb" address 192.168.1.2:7812;
}
}
For my write performance test, I used this FIO script:
[seq-write-24]
filename=/disks/0/fio:/disks/1/fio:/disks/2/fio:/disks/3/fio:/disks/4/fio:/disks/5/fio:/disks/6/fio:/disks/7/fio:/disks/8/fio:/disks/9/fio:/disks/10/fio:/disks/11/fio:/disks/12/fio:/disks/13/fio:/disks/14/fio:/disks/15/fio:/disks/16/fio:/disks/17/fio:/disks/18/fio:/disks/19/fio:/disks/20/fio:/disks/21/fio:/disks/22/fio:/disks/23/fio
file_service_type=roundrobin:1
size=2400G
rw=write
bs=1M
direct=1
buffered=0
numjobs=24
time_based=1
runtime=60
ioengine=libaio
iodepth=24
group_reporting
With RDMA and this configuration, I got 12.2 GB/s overall:
net {
protocol C;
transport "rdma";
sndbuf-size 25M;
rcvbuf-size 25M;
rdma-ctrl-sndbuf-size 600K;
rdma-ctrl-rcvbuf-size 600K;
max-buffers 128K;
max-epoch-size 20000;
}
With TCP and this configuration, I got 24.3 GB/s overall:
net {
protocol C;
transport "tcp";
sndbuf-size 128M;
rcvbuf-size 128M;
max-buffers 128K;
max-epoch-size 20000;
}
Without replication (DRBD down on hostb), I achieved 129 GB/s