Multiple DRBD Paths Behavior

forbin · June 28, 2025, 1:53am

I have the following resource. It replicates and syncs fine on the second path. However, when that path goes down, it does not failover to the first path. Why not?

[root@ha57a linstor.d]# cat site622.res
# This file was generated by LINSTOR (1.31.1), do not edit manually.
# Name
#   LINSTOR nodename: ha57a
#   Local hostname  : ha57a
# File generated at:
#   Local time      : 2025-06-27 21:32:16
#   UTC             : 2025-06-28 01:32:16

resource "site622"
{

    options
    {
        auto-promote no;
        on-no-quorum io-error;
        quorum off;
    }

    net
    {
        cram-hmac-alg     sha1;
        shared-secret     "<redacted>";
        verify-alg "crct10dif";
    }

    on "ha57a"
    {
        volume 0
        {
            disk        /dev/mapper/Linstor-Crypt-site622_00000;
            disk
            {
                discard-zeroes-if-aligned no;
            }
            meta-disk   internal;
            device      minor 1000;
        }
        node-id    0;
    }

    on "ha57b"
    {
        volume 0
        {
            disk        /dev/drbd/this/is/not/used;
            disk
            {
                discard-zeroes-if-aligned no;
            }
            meta-disk   internal;
            device      minor 1000;
        }
        node-id    1;
    }

    connection
    {
        path
        {
            host "ha57a" address ipv4 192.168.9.127:7000;
            host "ha57b" address ipv4 192.168.9.128:7000;
        }

        path
        {
            host "ha57a" address ipv4 198.51.100.127:7000;
            host "ha57b" address ipv4 198.51.100.128:7000;
        }
    }
}

[root@ha57a linstor.d]# lst n i l ha57a
╭───────────────────────────────────────────────────────────────────╮
┊ ha57a     ┊ NetInterface ┊ IP             ┊ Port ┊ EncryptionType ┊
╞═══════════════════════════════════════════════════════════════════╡
┊ + StltCon ┊ bond0-front  ┊ 192.168.9.127  ┊ 3366 ┊ PLAIN          ┊
┊ +         ┊ bond1-repl   ┊ 198.51.100.127 ┊      ┊                ┊
╰───────────────────────────────────────────────────────────────────╯
[root@ha57a linstor.d]# lst n i l ha57b
╭───────────────────────────────────────────────────────────────────╮
┊ ha57b     ┊ NetInterface ┊ IP             ┊ Port ┊ EncryptionType ┊
╞═══════════════════════════════════════════════════════════════════╡
┊ + StltCon ┊ bond0-front  ┊ 192.168.9.128  ┊ 3366 ┊ PLAIN          ┊
┊ +         ┊ bond1-repl   ┊ 198.51.100.128 ┊      ┊                ┊
╰───────────────────────────────────────────────────────────────────╯
[root@ha57a linstor.d]#
[root@ha57a linstor.d]# lst rc p l ha57a ha57b site622
╭─────────────────────────────────╮
┊ Key               ┊ Value       ┊
╞═════════════════════════════════╡
┊ Paths/path0/ha57a ┊ bond0-front ┊
┊ Paths/path0/ha57b ┊ bond0-front ┊
┊ Paths/path1/ha57a ┊ bond1-repl  ┊
┊ Paths/path1/ha57b ┊ bond1-repl  ┊
╰─────────────────────────────────╯

Devin · June 30, 2025, 11:41pm

How are you testing the path failures?

We have recently discovered some issues with multiple paths within DRBD failing to detect a failure when it occurs at the link layer. Until we can better diagnose and improve this, we are suggesting just using bonded links for the time being.

forbin · July 1, 2025, 12:06am

Just by downing the interface with the command “nmcli conn down bond1”

Note that our servers have 4 interfaces: 2 bonded front-facing NICs (bond0), plus 2 bonded rear-facing NICs (bond1), where each bond is connected to a disjoint network.

forbin · July 1, 2025, 1:38am

I would assume it should only care about connectivity at the transport layer. If it’s lost there, then switch to the other connection.

Devin · July 1, 2025, 4:48pm

That is the idea, and why we consider this a known issue at the moment. It is on the TODO to fix. In the meantime we suggest just using bonded connections for network redundancy.

forbin · July 1, 2025, 7:31pm

We already use bonded connections, as mentioned. Those have their own problems. If you use MII mode, then the bond relies on link, so it misses upstream issues that go beyond the directly connected switch. Even if you use arp_ip_target, if the backbone link in the cable plant is bad (such as an interconnect between two adjacent datacenters) then the bond still goes down. That’s why we have redundant fibers between our DCs, connected to disjoint networks. If DC link 1 goes down, then bond1 goes down. Then we need it to switch to bond0. Do you have any idea how soon this problem will be addressed?

Blockquote

forbin · July 30, 2025, 1:08am

Anything new on this problem?

Topic		Replies	Views
DRBD multipath load balancing and failover test DRBD drbd	0	123	November 23, 2024
DRBD failover TCP links? General drbd	3	137	March 7, 2025
Multiple paths in an active-backup-like manner General drbd	2	58	February 24, 2025
Stress test for drbd and pacemaker General drbd	4	48	May 11, 2025
Resource connection path for ring topology LINSTOR	9	311	March 18, 2025

Multiple DRBD Paths Behavior

Related topics