Multiple DRBD Paths Behavior

I have the following resource. It replicates and syncs fine on the second path. However, when that path goes down, it does not failover to the first path. Why not?

[root@ha57a linstor.d]# cat site622.res
# This file was generated by LINSTOR (1.31.1), do not edit manually.
# Name
#   LINSTOR nodename: ha57a
#   Local hostname  : ha57a
# File generated at:
#   Local time      : 2025-06-27 21:32:16
#   UTC             : 2025-06-28 01:32:16

resource "site622"
{

    options
    {
        auto-promote no;
        on-no-quorum io-error;
        quorum off;
    }

    net
    {
        cram-hmac-alg     sha1;
        shared-secret     "<redacted>";
        verify-alg "crct10dif";
    }

    on "ha57a"
    {
        volume 0
        {
            disk        /dev/mapper/Linstor-Crypt-site622_00000;
            disk
            {
                discard-zeroes-if-aligned no;
            }
            meta-disk   internal;
            device      minor 1000;
        }
        node-id    0;
    }

    on "ha57b"
    {
        volume 0
        {
            disk        /dev/drbd/this/is/not/used;
            disk
            {
                discard-zeroes-if-aligned no;
            }
            meta-disk   internal;
            device      minor 1000;
        }
        node-id    1;
    }

    connection
    {
        path
        {
            host "ha57a" address ipv4 192.168.9.127:7000;
            host "ha57b" address ipv4 192.168.9.128:7000;
        }

        path
        {
            host "ha57a" address ipv4 198.51.100.127:7000;
            host "ha57b" address ipv4 198.51.100.128:7000;
        }
    }
}

[root@ha57a linstor.d]# lst n i l ha57a
╭───────────────────────────────────────────────────────────────────╮
┊ ha57a     ┊ NetInterface ┊ IP             ┊ Port ┊ EncryptionType ┊
╞═══════════════════════════════════════════════════════════════════╡
┊ + StltCon ┊ bond0-front  ┊ 192.168.9.127  ┊ 3366 ┊ PLAIN          ┊
┊ +         ┊ bond1-repl   ┊ 198.51.100.127 ┊      ┊                ┊
╰───────────────────────────────────────────────────────────────────╯
[root@ha57a linstor.d]# lst n i l ha57b
╭───────────────────────────────────────────────────────────────────╮
┊ ha57b     ┊ NetInterface ┊ IP             ┊ Port ┊ EncryptionType ┊
╞═══════════════════════════════════════════════════════════════════╡
┊ + StltCon ┊ bond0-front  ┊ 192.168.9.128  ┊ 3366 ┊ PLAIN          ┊
┊ +         ┊ bond1-repl   ┊ 198.51.100.128 ┊      ┊                ┊
╰───────────────────────────────────────────────────────────────────╯
[root@ha57a linstor.d]#
[root@ha57a linstor.d]# lst rc p l ha57a ha57b site622
╭─────────────────────────────────╮
┊ Key               ┊ Value       ┊
╞═════════════════════════════════╡
┊ Paths/path0/ha57a ┊ bond0-front ┊
┊ Paths/path0/ha57b ┊ bond0-front ┊
┊ Paths/path1/ha57a ┊ bond1-repl  ┊
┊ Paths/path1/ha57b ┊ bond1-repl  ┊
╰─────────────────────────────────╯

How are you testing the path failures?

We have recently discovered some issues with multiple paths within DRBD failing to detect a failure when it occurs at the link layer. Until we can better diagnose and improve this, we are suggesting just using bonded links for the time being.

Just by downing the interface with the command “nmcli conn down bond1

Note that our servers have 4 interfaces: 2 bonded front-facing NICs (bond0), plus 2 bonded rear-facing NICs (bond1), where each bond is connected to a disjoint network.

I would assume it should only care about connectivity at the transport layer. If it’s lost there, then switch to the other connection.

That is the idea, and why we consider this a known issue at the moment. It is on the TODO to fix. In the meantime we suggest just using bonded connections for network redundancy.

We already use bonded connections, as mentioned. Those have their own problems. If you use MII mode, then the bond relies on link, so it misses upstream issues that go beyond the directly connected switch. Even if you use arp_ip_target, if the backbone link in the cable plant is bad (such as an interconnect between two adjacent datacenters) then the bond still goes down. That’s why we have redundant fibers between our DCs, connected to disjoint networks. If DC link 1 goes down, then bond1 goes down. Then we need it to switch to bond0. Do you have any idea how soon this problem will be addressed?

Blockquote