Stress test for drbd and pacemaker

Anton · May 9, 2025, 11:55am

Hi

There are two nodes active/standby cluster.

For sync/tcp drbd replication there are two links.

If I unplug the first link, then the failover to the second link works fine. However, if I unplug the second link, both cluster nodes are immediately powered off.

Is it configurable behaviour ?, any other options if all drbd links failed ?

Anton

robin-checkmk · May 9, 2025, 12:38pm

Well, if both links are offline and the nodes do not see each other anymore, what would you expect them to do? I think you can configure them to stay online, but then you are in for a split-brain sooner than later.

Anton · May 9, 2025, 1:15pm

Well, if both links are offline and the nodes do not see each other anymore, what would you expect them to do?

The cluster nodes still see each other, dedicated links are offline only for drbd replication. For instance, drbd/pacemaker may allow reads and block all writes from clients to drbd primary volumes.

Or even accept writes from clients for drbd primary and waiting while drbd replication links will back online.
I’m fully agree that it may cause split brain in the future, but I would clarify, the cluster nodes shutdown is the only way in this case, or there are other possibilities…

Anton

Devin · May 9, 2025, 6:48pm

This is most likely configured using the fencing and fence-peer keyword in your DRBD configuration. If the nodes are powering off then you likely have it configured with fencing resource-and-stonith;. The default here is dont-care.

It might also be possible that Pacemaker is fencing, and powering-down, the nodes. Perhaps the Corosync communication is configured to use the same links?

You may also want to consider both changing the fencing action from off to reboot, and configuring a pcmk_delay_base to the fencing action of one of the nodes. By delaying the fencing action of one node you should always be left with one survivor and not have both power down or reboot at the same time.

Anton · May 11, 2025, 3:35pm

This is most likely configured using the fencing and fence-peer keyword in your DRBD configuration.
If the nodes are powering off then you likely have it configured with fencing resource-and-stonith;. The default here is dont-care.

With the “fencing dont-care” I have the same behaviour - unplug all (two) dedicated drbd replication links and both cluster nodes powered off.
Exactly the same behaviour for “fencing resource-only”

It might also be possible that Pacemaker is fencing, and powering-down, the nodes.
Perhaps the Corosync communication is configured to use the same links?

OMG… I may missed that.
Here is below my corosync.conf

[root@memverge anton]# cat /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: cluster_anton
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
cluster_uuid: 7a103b6fade74c019654504150bdd64b
knet_mtu: 8982
}

nodelist {
node {
name: memverge
nodeid: 27
quorum_votes: 1
ring0_addr: 192.168.0.6
ring1_addr: 1.1.1.6
}

node {
    name: memverge2
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 192.168.0.8
    ring1_addr: 1.1.1.8
}

}

quorum {
provider: corosync_votequorum
two_node: 1
}

logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}
[root@memverge anton]#

I want that drbd replication use two links (ring0 and ring1), but pacemaker/corosync communication goes through another links -

10.72.14.152 memverge
10.72.14.154 memverge2

You may also want to consider both changing the fencing action from off to reboot,
and configuring a pcmk_delay_base to the fencing action of one of the nodes.
By delaying the fencing action of one node you should always be left with one survivor and not have both power down or reboot at the same time.

Since this is two nodes active/standby cluster and two resources (ha-nfs and ha-iscsi) always run together on the same node, I tried next commands -

[root@memverge anton]# pcs property set priority-fencing-delay=15s
[root@memverge anton]#
[root@memverge anton]# pcs resource defaults update priority=1
Warning: Defaults do not apply to resources which override them with their own defined values
[root@memverge anton]#

And now if I unplugged all drbd replication links, there is fencing for standby node only.

It seems exactly what I was looking for.

Thank you!!

Topic		Replies	Views
Cluster drbd9, pacemaker, corosync, ubuntu DRBD drbd	2	34	May 9, 2025
Testing pacemaker/drbd failover on two nodes active-standby cluster based on rocky 9.5/pacemaker 2.1.8/drbd 9.2.12 General drbd	10	81	March 31, 2025
Best Practices for Restarting Network Services in a DRBD, Pacemaker, and Corosync Configured Cluster on CentOS7 General	2	245	June 12, 2024
Split-brain issue in drbd 9.2.13 DRBD drbd	0	53	April 25, 2025
Drbd + gfs2/ocfs DRBD drbd	0	86	March 3, 2025

Stress test for drbd and pacemaker

Related topics