DRBD/Pacemaker failover issue

Hello

There is two-node active/standby DRBD/Pacemaker cluster configured with qdevice.
There are two resource groups (g-nfs, g-iscsi) configured to always run on the same node.

Could you please help me understand the next behavior - during reboot active node memverge, ha-iscsi was successfully promoted to standby memverge2, while ha-nfs was failed due to timed out. This blocked start resources on standby memverge2 node.

Here is below more description behaviour, logs and configs.

Cluster with quorum, qdevice temporarily unavailable. Two nodes up and running, resource groups are running on memverge.

Perform os update (dnf update) on memverge, reboot. Boot is blocked due to the LUK2 encryption password requirement.

Cluster without quorum, all resource groups are stopped.

Qdevice back online, cluster with quorum again, but resource groups are still stopped with next status,

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: memverge2 (28) (version 3.0.0-5.1.el10_0-48413c8) - partition with quorum
  * Last updated: Sat Oct 18 09:12:43 2025 on memverge2
  * Last change:  Sat Oct 18 09:06:03 2025 by root via root on memverge2
  * 2 nodes configured
  * 22 resource instances configured

Node List:
  * Node memverge (27): OFFLINE
  * Node memverge2 (28): online, feature set 3.20.1

Full List of Resources:
  * ipmi-fence-memverge2        (stonith:fence_ipmilan):         Stopped
  * ipmi-fence-memverge (stonith:fence_ipmilan):         Started memverge2
  * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
    * ha-nfs    (ocf:linbit:drbd):       Unpromoted memverge2
    * ha-nfs    (ocf:linbit:drbd):       Stopped
  * Resource Group: g-nfs:
    * pb_nfs    (ocf:heartbeat:portblock):       Stopped
    * ip0_nfs   (ocf:heartbeat:IPaddr2):         Stopped
    * fs_nfs_internal_info_HA   (ocf:heartbeat:Filesystem):      Stopped
    * fs_nfsshare_exports_HA    (ocf:heartbeat:Filesystem):      Stopped
    * nfsserver (ocf:heartbeat:nfsserver):       Stopped
    * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs):        Stopped
    * samba_service     (systemd:smb):   Stopped
    * fs_sambashare_exports_HA  (ocf:heartbeat:Filesystem):      Stopped
    * punb_nfs  (ocf:heartbeat:portblock):       Stopped
  * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
    * ha-iscsi  (ocf:linbit:drbd):       Promoted memverge2
    * ha-iscsi  (ocf:linbit:drbd):       Stopped
  * Resource Group: g-iscsi:
    * pb_iscsi  (ocf:heartbeat:portblock):       Stopped
    * ip0_iscsi (ocf:heartbeat:IPaddr2):         Stopped
    * ip1_iscsi (ocf:heartbeat:IPaddr2):         Stopped
    * iscsi_target      (ocf:heartbeat:iSCSITarget):     Stopped
    * iscsi_lun_drbd3   (ocf:heartbeat:iSCSILogicalUnit):        Stopped
    * iscsi_lun_drbd4   (ocf:heartbeat:iSCSILogicalUnit):        Stopped
    * punb_iscsi        (ocf:heartbeat:portblock):       Stopped


Failed Resource Actions:
  * ha-nfs_promote_0 on memverge2 'Error occurred' (1): call=133, status='Timed out', exitreason='Resource agent did not complete within 1m30s', last-rc-change='Sat Oct 18 09:06:04 2025', queued=0ms, exec=90738ms



[root@memverge2 ~]# cat /var/log/messages|grep -i ha-nfs_promote
Oct 18 09:06:04 memverge2 pacemaker-controld[2939]: notice: Initiating promote operation ha-nfs_promote_0 locally on memverge2
Oct 18 09:07:35 memverge2 pacemaker-controld[2939]: notice: Transition 803 action 7 (ha-nfs_promote_0 on memverge2): expected 'OK' but got 'Error occurred'
[root@memverge2 ~]#
[root@memverge2 ~]# cat /var/log/messages|grep -i ha-iscsi_promote
Oct 18 09:06:03 memverge2 pacemaker-controld[2939]: notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2
Oct 18 09:11:48 memverge2 pacemaker-controld[2939]: notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2

[root@memverge2 ~]# cat /var/log/pacemaker/pacemaker.log|grep -i ha-nfs_promote
Oct 18 09:06:04.547 memverge2 pacemaker-controld  [2939] (execute_rsc_action)   notice: Initiating promote operation ha-nfs_promote_0 locally on memverge2 | action 7
Oct 18 09:06:04.547 memverge2 pacemaker-controld  [2939] (do_lrm_rsc_op)        notice: Requesting local execution of promote operation for ha-nfs on memverge2 | transition_key=7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 op_key=ha-nfs_promote_0
Oct 18 09:06:04.547 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: +  /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-nfs']/lrm_rsc_op[@id='ha-nfs_last_0']:  @operation_key=ha-nfs_promote_0, @operation=promote, @crm-debug-origin=controld_update_resource_history, @transition-key=7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @transition-magic=-1:193;7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1760767564, @exec-time=0
Oct 18 09:07:35.285 memverge2 pacemaker-execd     [2936] (async_action_complete)        info: Resource agent ha-nfs_promote_0[3534142] timed out after 1m30s
Oct 18 09:07:35.285 memverge2 pacemaker-controld  [2939] (log_executor_event)   error: Result of promote operation for ha-nfs on memverge2: Timed out after 1m30s (Resource agent did not complete within 1m30s) | graph action confirmed; call=133 key=ha-nfs_promote_0
Oct 18 09:07:35.286 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: ++ /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-nfs']:  <lrm_rsc_op id="ha-nfs_last_failure_0" operation_key="ha-nfs_promote_0" operation="promote" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.20.1" transition-key="7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309" transition-magic="2:1;7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309" exit-reason="Resource agent did not complete within 1m30s" on_node="memverge2" call-id="133" rc-code="1" op-status="2" interval="0" last-rc-change="1760767564" exec-time="90738" queue-time="0" op-digest="876800cf6419330beb37dc5fd73e30fb"/>
Oct 18 09:07:35.286 memverge2 pacemaker-controld  [2939] (abort_transition_graph)       info: Transition 803 aborted by operation ha-nfs_promote_0 'modify' on memverge2: Event failed | magic=2:1;7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 cib=0.4055.28 source=process_graph_event:559 complete=false
Oct 18 09:07:35.286 memverge2 pacemaker-controld  [2939] (process_graph_event)  notice: Transition 803 action 7 (ha-nfs_promote_0 on memverge2): expected 'OK' but got 'Error occurred' | target-rc=0 rc=1 call-id=133
[root@memverge2 ~]#

[root@memverge2 ~]# cat /var/log/pacemaker/pacemaker.log|grep -i ha-iscsi_promote
Oct 18 09:06:03.193 memverge2 pacemaker-controld  [2939] (execute_rsc_action)   notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2 | action 71
Oct 18 09:06:03.193 memverge2 pacemaker-controld  [2939] (do_lrm_rsc_op)        notice: Requesting local execution of promote operation for ha-iscsi on memverge2 | transition_key=71:801:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 op_key=ha-iscsi_promote_0
Oct 18 09:06:03.194 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: +  /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-iscsi']/lrm_rsc_op[@id='ha-iscsi_last_0']:  @operation_key=ha-iscsi_promote_0, @operation=promote, @crm-debug-origin=controld_update_resource_history, @transition-key=71:801:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @transition-magic=-1:193;71:801:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1760767563, @exec-time=0
Oct 18 09:06:03.382 memverge2 pacemaker-controld  [2939] (log_executor_event)   notice: Result of promote operation for ha-iscsi on memverge2: OK | graph action confirmed; call=125 key=ha-iscsi_promote_0 rc=0
Oct 18 09:06:03.384 memverge2 pacemaker-controld  [2939] (process_graph_event)  info: Transition 801 action 71 (ha-iscsi_promote_0 on memverge2) confirmed: OK | rc=0 call-id=125
Oct 18 09:11:48.123 memverge2 pacemaker-controld  [2939] (execute_rsc_action)   notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2 | action 37
Oct 18 09:11:48.123 memverge2 pacemaker-controld  [2939] (do_lrm_rsc_op)        notice: Requesting local execution of promote operation for ha-iscsi on memverge2 | transition_key=37:807:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 op_key=ha-iscsi_promote_0
Oct 18 09:11:48.124 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: +  /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-iscsi']/lrm_rsc_op[@id='ha-iscsi_last_0']:  @operation_key=ha-iscsi_promote_0, @operation=promote, @transition-key=37:807:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @transition-magic=-1:193;37:807:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1760767908, @exec-time=0
Oct 18 09:11:48.154 memverge2 pacemaker-controld  [2939] (log_executor_event)   notice: Result of promote operation for ha-iscsi on memverge2: OK | graph action confirmed; call=156 key=ha-iscsi_promote_0 rc=0
Oct 18 09:11:48.155 memverge2 pacemaker-controld  [2939] (process_graph_event)  info: Transition 807 action 37 (ha-iscsi_promote_0 on memverge2) confirmed: OK | rc=0 call-id=156
[root@memverge2 ~]#
[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

[root@memverge2 ~]#

Location Constraints:
  resource 'ipmi-fence-memverge' avoids node 'memverge' with score INFINITY (id: location-ipmi-fence-memverge-memverge--INFINITY)
  resource 'ipmi-fence-memverge2' avoids node 'memverge2' with score INFINITY (id: location-ipmi-fence-memverge2-memverge2--INFINITY)
  resource 'ha-iscsi-clone' (id: drbd-fence-by-handler-ha-iscsi-ha-iscsi-clone)
    Rules:
      Rule: role=Promoted score=-INFINITY (id: drbd-fence-by-handler-ha-iscsi-rule-ha-iscsi-clone)
        Expression: #uname ne memverge2 (id: drbd-fence-by-handler-ha-iscsi-expr-28-ha-iscsi-clone)
Colocation Constraints:
  Started resource 'g-nfs' with Promoted resource 'ha-nfs-clone' (id: colocation-g-nfs-ha-nfs-clone-INFINITY)
    score=INFINITY
  Started resource 'g-iscsi' with Promoted resource 'ha-iscsi-clone' (id: colocation-g-iscsi-ha-iscsi-clone-INFINITY)
    score=INFINITY
  Started resource 'g-nfs' with Started resource 'g-iscsi' (id: colocation-g-nfs-g-iscsi-INFINITY)
    score=INFINITY
Order Constraints:
  promote resource 'ha-nfs-clone' then start resource 'g-nfs' (id: order-ha-nfs-clone-g-nfs-mandatory)
  promote resource 'ha-iscsi-clone' then start resource 'g-iscsi' (id: order-ha-iscsi-clone-g-iscsi-mandatory)
  start resource 'g-nfs' then start resource 'g-iscsi' (id: order-g-nfs-g-iscsi-mandatory)

Anton

[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

Your DRBD resources are not connected. I don’t see the status of the resources from memverge but I suspect it’s likely in StandAlone You’ve probably split-brained the resources at some point in all your testing (DRBD 9.0 en - LINBIT).

1 Like

Hhmm… I just cross-checked it again. There are no split-brained resources at the beginning of the test. The same behaviour as week before.

Cluster with quorum, qdevice temporarily unavailable. Two nodes up and running, resource groups are running on memverge.

[root@memverge ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:yes
  volume:4 disk:UpToDate open:yes
  memverge2 role:Secondary
    volume:3 peer-disk:UpToDate
    volume:4 peer-disk:UpToDate

ha-nfs role:Primary
  volume:1 disk:UpToDate open:yes
  volume:2 disk:UpToDate open:yes
  volume:5 disk:UpToDate open:yes
  memverge2 role:Secondary
    volume:1 peer-disk:UpToDate
    volume:2 peer-disk:UpToDate
    volume:5 peer-disk:UpToDate

[root@memverge ~]#

Perform os update (dnf update, kernel 6.12.0-55.39.1 → 6.12.0-55.40.1) on memverge, reboot. Boot is blocked due to the LUK2 encryption password requirement.

Cluster without quorum, all resource groups are stopped.

Failed Resource Actions:
  * ha-nfs_promote_0 on memverge2 'Error occurred' (1): call=113, status='Timed out', exitreason='Resource agent did not complete within 1m30s', last-rc-change='Sun Oct 26 09:45:01 2025', queued=0ms, exec=90152ms

[root@memverge2 ~]# drbdadm status
# No currently configured DRBD found.
[root@memverge2 ~]#

Qdevice back online, cluster with quorum again, but resource groups are still stopped with next status.

[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

[root@memverge2 ~]#

Anton

I assume the qdevices were configured within Pacemaker, correct?
Are we using any kind of fencing or quorum within DRBD? If so, try disabling and re-running the test. If that changes the test behavior then go ahead and share the DRBD configuration.

I assume the qdevices were configured within Pacemaker, correct?

Yes, correct.
I followed instructions described in link - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10/pdf/configuring_and_managing_high_availability_clusters/Red_Hat_Enterprise_Linux-10-Configuring_and_managing_high_availability_clusters-en-US.pdf, started at page 151.

Are we using any kind of fencing or quorum within DRBD?

No. For fencing or quorum I use fence_ipmilan and qdevice host.

Here is below drbd conf for ha-iscsi and ha-nfs,

[root@memverge drbd.d]# cat ha-iscsi.res
resource ha-iscsi {

options {
      auto-promote no;
#      quorum majority;
#      on-no-quorum io-error;
#      on-no-data-accessible suspend-io;
#      on-suspended-primary-outdated force-secondary;
        }

handlers {
      fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
      unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
    }

disk {
        c-plan-ahead 0;
        resync-rate 32M;
        al-extents 6007;
       no-disk-flushes;
       no-md-flushes;
       no-disk-barrier;
       no-disk-drain;
     }

  volume 3 {
    device      /dev/drbd3;
    disk        /dev/mapper/object_block_nfs_vg-ha_block_exports_lv_with_vdo_1x8;
    meta-disk   internal;
  }
  volume 4 {
    device      /dev/drbd4;
    disk        /dev/mapper/object_block_nfs_vg-ha_block_exports_lv_without_vdo;
    meta-disk   internal;
  }

  on memverge {
#    address   10.72.14.152:7901;
    node-id   27;
  }
  on memverge2 {
#    address   10.72.14.154:7901;
    node-id   28;
  }

#  on qs {
#    volume 29 {
#      disk    none;
#    }
#    volume 30 {
#      disk    none;
#    }
#    address   10.72.14.156:7900;
#    node-id   29;
#  }

#  connection-mesh {
#    hosts memverge memverge2;
#                  }

net
    {
#        load-balance-paths      yes;
        transport tcp;
        protocol  C;
        sndbuf-size 10M;
        rcvbuf-size 10M;
        max-buffers 80K;
        max-epoch-size 20000;
        timeout 90;
        ping-timeout 10;
        ping-int 15;
        connect-int 15;
        fencing resource-and-stonith;
    }
connection
    {
        path
        {
            host memverge address 192.168.0.6:7901;
            host memverge2 address 192.168.0.8:7901;
        }
        path
        {
            host memverge address 1.1.1.6:7901;
            host memverge2 address 1.1.1.8:7901;
        }
net
    {
#        load-balance-paths      yes;
        transport tcp;
        protocol  C;
        sndbuf-size 10M;
        rcvbuf-size 10M;
        max-buffers 80K;
        max-epoch-size 20000;
        timeout 90;
        ping-timeout 10;
        ping-int 15;
        connect-int 15;
        fencing resource-and-stonith;
    }
    }

}
[root@memverge drbd.d]#
[root@memverge drbd.d]#
[root@memverge drbd.d]#
[root@memverge drbd.d]# cat ha-nfs.res
resource ha-nfs {

options {
      auto-promote no;
#      quorum majority;
#      on-no-quorum io-error;
#      on-no-data-accessible suspend-io;
#      on-suspended-primary-outdated force-secondary;
        }

handlers {
      fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
      unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
    }

disk {
        c-plan-ahead 0;
        resync-rate 32M;
        al-extents 6007;
     }

  volume 1 {
    device      /dev/drbd1;
    disk        /dev/mapper/object_block_nfs_vg-ha_nfs_exports_lv_with_vdo_1x8;
    meta-disk   internal;
  }
  volume 2 {
    device      /dev/drbd2;
    disk        /dev/mapper/object_block_nfs_vg-ha_nfs_internal_lv_without_vdo;
    meta-disk   internal;
  }
  volume 5 {
    device      /dev/drbd5;
    disk        /dev/mapper/object_block_nfs_vg-ha_samba_exports_lv_with_vdo_1x8;
    meta-disk   internal;
  }

  on memverge {
#    address   10.72.14.152:7900;
    node-id   27;
  }
  on memverge2 {
#    address   10.72.14.154:7900;
    node-id   28;
  }

#  on qs {
#    volume 29 {
#      disk    none;
#    }
#    volume 30 {
#      disk    none;
#    }
#    address   10.72.14.156:7900;
#    node-id   29;
#  }

#  connection-mesh {
#    hosts memverge memverge2;
#                  }

net
    {
#        load-balance-paths      yes;
        transport tcp;
        protocol  C;
        sndbuf-size 10M;
        rcvbuf-size 10M;
        max-buffers 80K;
        max-epoch-size 20000;
        timeout 90;
        ping-timeout 10;
        ping-int 15;
        connect-int 15;
        fencing resource-and-stonith;
    }
#               }
connection
    {
        path
        {
            host memverge address 192.168.0.6:7900;
            host memverge2 address 192.168.0.8:7900;
        }
        path
        {
            host memverge address 1.1.1.6:7900;
            host memverge2 address 1.1.1.8:7900;
        }
net
    {
#        load-balance-paths      yes;
        transport tcp;
        protocol  C;
        sndbuf-size 10M;
        rcvbuf-size 10M;
        max-buffers 80K;
        max-epoch-size 20000;
        timeout 90;
        ping-timeout 10;
        ping-int 15;
        connect-int 15;
        fencing resource-and-stonith;
    }
    }

}
[root@memverge drbd.d]#
[root@memverge drbd.d]#

Anton

Ah hah! Just as I suspected.

So, you do have fencing enabled. Note the fencing resource-and-stonith; in your DRBD configuration. In that case everything is working as I would expect.
With fencing configured within DRBD, a lone DRBD node will come up with a disk state of Consistent. In this state you cannot promote this DRBD device to primary without using --force. This is per design and to prevent a lone DRBD from going primary after a reboot and causing a split-brain. Once the node can again reach its peers and compare the data the node will fix the disk states assuming no split-brain already occurred. Essentially, it’s just a solution or safeguard to the “I started up isolated from the rest of the nodes, I have no way to guarantee my copy of the data is the right one” scenario.

Please try commenting this line out of the configuration and re-running your test again. I suspect things will behave as you expect them to in that case. At that point, you need decide which behavior you prefer. Less risk of split-brains or the ability for a single node to boot and run without admin intervention (i.e. --force)? Personally, with proper fencing within Pacemaker and a q-device, I would feel “safe enough”, even without fencing within DRBD.

Please try commenting this line out of the configuration and re-running your test again. I suspect things will behave as you expect them to in that case. At that point, you need decide which behavior you prefer. Less risk of split-brains or the ability for a single node to boot and run without admin intervention (i.e. --force)? Personally, with proper fencing within Pacemaker and a q-device, I would feel “safe enough”, even without fencing within DRBD.

To repeat the same test, I would wait for the nearest available kernel update. Hope it will happen in the next few days.

I have started thinking in the same way, when yesterday I opened drbd conf files.
But there is something that very confused me and I can’t explain that.

Please note, that when qdevice comes back online, cluster with quorum again, but resource groups are still stopped with next status.

[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

There are two drbd resources in different states. Why ha-iscsi in Primary/UpTpDate state, while ha-nfs in Secondary/Consistent state ? I suspect that two drbd resources must have the same states even with the current configuration, both resources in Primary/UpTpDate or both in Secondary/Consistent state, but not in different states.

Or am I wrong ?

Anton

Please note, that when qdevice comes back online, cluster with quorum again, but resource groups are still stopped with next status.

I think it’s important to understand that DRBD is configured with only two nodes and has no concept of quorum at all here. Pacemaker does, sure, and it is pacemaker which manages and starts the resource groups. I strongly believe it is DRBD which is preventing Pacemaker from starting the resource-groups in this case though.

While yes, 90% of the time the two different DRBD resources are going to be in the same states. They are still very much two completely independent DRBD resources. They each have their own sets of data and DRBD meta-data. It is not impossible for them to be in two different states like you are observing.

The Consistent disk sate you observe on the ha-nfs is likely because that resource restarted due to Pacemaker trying to recover from the ha-nfs_promote_0 on memverge2 'Error occurred' (1): call=133, status='Timed out' error.

One thing I did notice with the DRBD configuration is the duplicate net {} stanzas in the DRBD configuration. I would expect this to return with a parser error. I am not sure how you this works in your environment, but I would advise removing/commenting one of them.

There were no parser errors, because drbdadm adjust all worked without errors. Anyway, I just commented duplicated net {} for two resources.

I also commented fencing resource-and-stonith for two resources and repeat the entire original test without kernel update (as not yet available). All looks good, and I will repeat the test with kernel update as was original test.

But now I’m afraid what behaviour will be, in case of using default fencing dont-care and all DRBD replication links failed, or simply physically unplugged in my test case.

Anton