Hello,
I’m experiencing issues with multi-path configuration on DRBD 9.3.0 (9.33.0 tools) running on Debian 13.2. I’ve configured two separate paths for replication, but only one path is active (established:yes), the other is always established:no. When the active path fails, the connection drops completely even though the second path is available.
Context: These are test VMs to validate the configuration before deploying to production servers. The production servers will use RDMA (not TCP) for replication and have the following hardware:
-
SuperMicro X10 dual socket platform
-
64GB RAM
-
12x 2TB HDD drives
-
2x 1.92TB NVMe (will be a separate DRBD resource)
-
2x 10GbE RDMA-capable NICs
I’m also deciding between fewer CPU cores with higher frequency vs more cores with lower frequency. What’s more important for DRBD performance?
This is why it’s critical that multi-path works properly in the test environment before migration.
Test Environment:
-
2 nodes (stor01, stor02) - VMs
-
2 separate 10GbE networks (192.168.1.x and 192.168.2.x) each on a separate virtual switch
-
MTU 9000 on all interfaces
-
2 resources (zfs-resource and disk-resource)
-
Transport: TCP (production will use RDMA)
Configuration:
resource zfs-resource {
device /dev/drbd0;
disk /dev/zvol/drbd/drbd-zvol;
meta-disk internal;
net {
load-balance-paths yes;
transport "tcp";
}
on stor01 {
node-id 0;
}
on stor02 {
node-id 1;
}
connection {
path {
host stor01 address ipv4 192.168.1.1:7789;
host stor02 address ipv4 192.168.1.2:7789;
}
path {
host stor01 address ipv4 192.168.2.1:7789;
host stor02 address ipv4 192.168.2.2:7789;
}
}
}
Issues:
1. Only one path is established:
# drbdsetup events2 --now
exists path name:zfs-resource peer-node-id:1 conn-name:stor02 local:ipv4:192.168.2.1:7789 peer:ipv4:192.168.2.2:7789 established:no
exists path name:zfs-resource peer-node-id:1 conn-name:stor02 local:ipv4:192.168.1.1:7789 peer:ipv4:192.168.1.2:7789 established:yes
Only the 192.168.1.x path is active, 192.168.2.x is established:no. The network is fully functional - I can ping both networks without issues.
2. No failover between paths: When I bring down net-sw01 (192.168.1.x network):
# drbdadm status
disk-resource role:Primary
stor02 connection:Connecting
zfs-resource role:Primary
stor02 connection:Connecting
The connection drops completely even though the 192.168.2.x network is active and reachable. I expected DRBD to automatically switch to the second path.
3. Low performance with load-balance-paths:
-
Without load-balance-paths: ~1300 MB/s write throughput
-
With load-balance-paths yes: ~200-300 MB/s write throughput
Testing with:
mkfs.ext4 /dev/drbd0
mount /dev/drbd0 /mnt/disk
dd if=/dev/zero of=/mnt/disk/test bs=1M status=progress
Global configuration (global_common.conf):
common {
net {
protocol C;
verify-alg crc32c;
max-buffers 32000;
sndbuf-size 16M;
rcvbuf-size 16M;
max-epoch-size 10000;
tcp-cork no;
timeout 10;
ping-int 2;
ping-timeout 1;
connect-int 5;
ko-count 3;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
c-plan-ahead 20;
c-max-rate 1G;
c-fill-target 100M;
c-min-rate 100M;
c-delay-target 10;
on-io-error detach;
al-extents 65536;
}
}
Questions:
-
Why is only one path established instead of both?
-
How can I make DRBD failover between paths when the active one fails?
-
Why does load-balance-paths drastically reduce performance?
-
Is additional configuration needed for multi-path to work properly?
-
With RDMA transport (for production), does multi-path behavior differ from TCP?
Additional questions for production setup:
Storage Architecture: I will have 2 separate DRBD resources:
-
Resource 1 (HDD array): 12x 2TB HDD drives
-
Resource 2 (NVMe array): 2x 1.92TB NVMe drives
For the HDD array, which option is better:
-
Option A: Hardware RAID10 controller → DRBD directly on RAID volume
-
Option B: ZFS RAID10 (6x mirrors) → DRBD zvol on ZFS pool (64GB RAM available for ZFS cache)
For the NVMe array:
-
Option A: DRBD directly on both NVMe drives (no RAID/ZFS between them)
-
Option B: ZFS mirror of both NVMe → DRBD zvol
Are there any known issues or performance impact when using DRBD on top of ZFS zvol?
CPU Selection: For the dual socket X10 platform, what’s more important for DRBD performance: fewer CPU cores with high frequency or more cores with lower frequency?
Thank you in advance for your help!
P.S.
I forgot to add: Happy holidays to everyone and fewer problems in the new year ![]()