DRBD quorum doesn't work with kind of disconnection

I’m new to DRBD, so maybe it’s just ignorance, but I ran into a problem with using quorum.
My environment. three AlmaLinux 9.4 servers, kmod-drbd9x-9.1.20-1.el9_4.elrepo, drbd9x-utils-9.28.0-1.el9.elrepo. Two servers have disks, the third is diskless. My configuration.

resource "md0r5" {
   device minor 1;
   disk "/dev/md0";
   meta-disk internal;

   net {
     protocol C;
     transport "tcp";
     load-balance-paths yes;
   }
options {
     majority quorum;
}

on "pcs1" {
     node-id 0;
   }
   on "pcs2" {
     node-id 1;
   }
   on "pcs3" {
     disk none;
     node-id 2;
   }

   connection {
     path {
       host pcs1 address 192.168.18.11:7789;
       host pcs2 address 192.168.18.21:7789;
     }
     path {
       host pcs1 address 192.168.19.11:7789;
       host pcs2 address 192.168.19.21:7789;
     }
   }
   connection {
     path {
       host pcs1 address 192.168.18.11:7789;
       host pcs3 address 192.168.18.31:7789;
     }
     path {
       host pcs1 address 192.168.19.11:7789;
       host pcs3 address 192.168.19.31:7789;
     }
   }
   connection {
     path {
       host pcs2 address 192.168.18.21:7789;
       host pcs3 address 192.168.18.31:7789;
     }
     path {
       host pcs2 address 192.168.19.21:7789;
       host pcs3 address 192.168.19.31:7789;
     }
   }
}

After starting DRBD, everything is online. If I disconnect the network of one server (both connections), the cluster maintains quorum and everything works. If I disconnect another one, the system reacts as I would expect, losing quorum and not allowing further changes to the disk. The problem occurs when I reconnect the system. The cluster reports that everything is OK, the disks are UpToDate and all systems are connected. When I disconnect the two systems again, the cluster no longer loses quorum and disk writes are enabled.

Is this standard behavior? Can I somehow influence it by changing the configuration?

Thank you

1 Like

Hi,

3y ago I’ve setup a Centos 7.9 apache/postgresql cluster with corosync/pacemaker & DRBD.
DRBD was used to create a distributed filesystem to store the www and the PG data/ and corosync/pacemaker to start apache and postgresql on the back-up node.

I’ve had a hard time to get DRBD in the free version (client had no budget) to play nice. Because the free version did not handle async and I had network delays in the LAN.

If it helps:
I remember there were some scripts that I’ve had to tweak, in order to handle fail-over events and to get alerts if it would get into split brain state (if nodes failed and only one node was left). In souch cases recovery should be a manual process.
But the client insisted they wanted automatic recovery.

I remember those helper scripts (handlers) were not included anymore in more recent versions of DRBD.

in resource cfg file:

handlers {
  split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}
# ...
net {
  protocol C;
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri call-pri-lost-after-sb;
}

This was the location on Centos 7.9 with DRBD install via normal repos (with yum).

ls -ltr /usr/lib/drbd/notify-out-of-sync.sh
lrwxrwxrwx 1 root root 9 Nov  1 14:18 /usr/lib/drbd/notify-out-of-sync.sh -> notify.sh

Notice it was a symlink to the notify.sh script.

Hope it points you in some direction.

Good luck and let me know, I am curious about people experiences with DRBD in production.

Hi there,

At first glance, the majority quorum; is an incorrect setting. I’m guessing your cluster never had quorum configured? You can verify this with drbdsetup show, if you want to see the default settings also in use, run drbdsetup show --show-defaults instead.

options {
    majority quorum;
}

Should be (taken from this section of the UG):

options {
    quorum majority;
    on-no-quorum suspend-io;
    on-no-data-accessible suspend-io;
    on-suspended-primary-outdated force-secondary;
}

Keep in mind that the order of disconnecting/reconnecting nodes will affect quorum in different ways depending on diskless/diskful nodes, graceful vs non-graceful disconnection (think unplugging the ethernet cable versus running drbdadm disconnect <res>, etc. For example, connecting a Diskful node that has lost quorum to a Diskless/Tiebreaker node will not restore quorum. These scenarios are well documented here.

After making changes to your DRBD resource config (run drbdadm adjust <res> on each node after changes), quorum should function as expected. If not, we can dig into it further.

Just note that we never really recommend automatic split-brain recovery in production environments.

The DRBD User’s Guide says it best:

You rather want to look into fencing policies, quorum settings, cluster manager integration, and redundant cluster manager communication links to avoid data divergence in the first place.

2 Likes

Oh, you have a three-node setup, with one diskless “tiebreaker” node and a redundant network. Nice. Keep in mind that as long as one path works, the complete connection is considered healthy.
See the documentation on quorum tiebreaker on when DRBD keeps quorum. Please also note that it makes a difference if you gracefully disconnect a node (with drbdadm disconnect res:peer_name) or if a connection is lost unexpectedly. See the documentation on last man standing.