DRBD/Pacemaker failover issue

Hello

There is two-node active/standby DRBD/Pacemaker cluster configured with qdevice.
There are two resource groups (g-nfs, g-iscsi) configured to always run on the same node.

Could you please help me understand the next behavior - during reboot active node memverge, ha-iscsi was successfully promoted to standby memverge2, while ha-nfs was failed due to timed out. This blocked start resources on standby memverge2 node.

Here is below more description behaviour, logs and configs.

Cluster with quorum, qdevice temporarily unavailable. Two nodes up and running, resource groups are running on memverge.

Perform os update (dnf update) on memverge, reboot. Boot is blocked due to the LUK2 encryption password requirement.

Cluster without quorum, all resource groups are stopped.

Qdevice back online, cluster with quorum again, but resource groups are still stopped with next status,

Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: memverge2 (28) (version 3.0.0-5.1.el10_0-48413c8) - partition with quorum
  * Last updated: Sat Oct 18 09:12:43 2025 on memverge2
  * Last change:  Sat Oct 18 09:06:03 2025 by root via root on memverge2
  * 2 nodes configured
  * 22 resource instances configured

Node List:
  * Node memverge (27): OFFLINE
  * Node memverge2 (28): online, feature set 3.20.1

Full List of Resources:
  * ipmi-fence-memverge2        (stonith:fence_ipmilan):         Stopped
  * ipmi-fence-memverge (stonith:fence_ipmilan):         Started memverge2
  * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
    * ha-nfs    (ocf:linbit:drbd):       Unpromoted memverge2
    * ha-nfs    (ocf:linbit:drbd):       Stopped
  * Resource Group: g-nfs:
    * pb_nfs    (ocf:heartbeat:portblock):       Stopped
    * ip0_nfs   (ocf:heartbeat:IPaddr2):         Stopped
    * fs_nfs_internal_info_HA   (ocf:heartbeat:Filesystem):      Stopped
    * fs_nfsshare_exports_HA    (ocf:heartbeat:Filesystem):      Stopped
    * nfsserver (ocf:heartbeat:nfsserver):       Stopped
    * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs):        Stopped
    * samba_service     (systemd:smb):   Stopped
    * fs_sambashare_exports_HA  (ocf:heartbeat:Filesystem):      Stopped
    * punb_nfs  (ocf:heartbeat:portblock):       Stopped
  * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
    * ha-iscsi  (ocf:linbit:drbd):       Promoted memverge2
    * ha-iscsi  (ocf:linbit:drbd):       Stopped
  * Resource Group: g-iscsi:
    * pb_iscsi  (ocf:heartbeat:portblock):       Stopped
    * ip0_iscsi (ocf:heartbeat:IPaddr2):         Stopped
    * ip1_iscsi (ocf:heartbeat:IPaddr2):         Stopped
    * iscsi_target      (ocf:heartbeat:iSCSITarget):     Stopped
    * iscsi_lun_drbd3   (ocf:heartbeat:iSCSILogicalUnit):        Stopped
    * iscsi_lun_drbd4   (ocf:heartbeat:iSCSILogicalUnit):        Stopped
    * punb_iscsi        (ocf:heartbeat:portblock):       Stopped


Failed Resource Actions:
  * ha-nfs_promote_0 on memverge2 'Error occurred' (1): call=133, status='Timed out', exitreason='Resource agent did not complete within 1m30s', last-rc-change='Sat Oct 18 09:06:04 2025', queued=0ms, exec=90738ms



[root@memverge2 ~]# cat /var/log/messages|grep -i ha-nfs_promote
Oct 18 09:06:04 memverge2 pacemaker-controld[2939]: notice: Initiating promote operation ha-nfs_promote_0 locally on memverge2
Oct 18 09:07:35 memverge2 pacemaker-controld[2939]: notice: Transition 803 action 7 (ha-nfs_promote_0 on memverge2): expected 'OK' but got 'Error occurred'
[root@memverge2 ~]#
[root@memverge2 ~]# cat /var/log/messages|grep -i ha-iscsi_promote
Oct 18 09:06:03 memverge2 pacemaker-controld[2939]: notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2
Oct 18 09:11:48 memverge2 pacemaker-controld[2939]: notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2

[root@memverge2 ~]# cat /var/log/pacemaker/pacemaker.log|grep -i ha-nfs_promote
Oct 18 09:06:04.547 memverge2 pacemaker-controld  [2939] (execute_rsc_action)   notice: Initiating promote operation ha-nfs_promote_0 locally on memverge2 | action 7
Oct 18 09:06:04.547 memverge2 pacemaker-controld  [2939] (do_lrm_rsc_op)        notice: Requesting local execution of promote operation for ha-nfs on memverge2 | transition_key=7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 op_key=ha-nfs_promote_0
Oct 18 09:06:04.547 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: +  /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-nfs']/lrm_rsc_op[@id='ha-nfs_last_0']:  @operation_key=ha-nfs_promote_0, @operation=promote, @crm-debug-origin=controld_update_resource_history, @transition-key=7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @transition-magic=-1:193;7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1760767564, @exec-time=0
Oct 18 09:07:35.285 memverge2 pacemaker-execd     [2936] (async_action_complete)        info: Resource agent ha-nfs_promote_0[3534142] timed out after 1m30s
Oct 18 09:07:35.285 memverge2 pacemaker-controld  [2939] (log_executor_event)   error: Result of promote operation for ha-nfs on memverge2: Timed out after 1m30s (Resource agent did not complete within 1m30s) | graph action confirmed; call=133 key=ha-nfs_promote_0
Oct 18 09:07:35.286 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: ++ /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-nfs']:  <lrm_rsc_op id="ha-nfs_last_failure_0" operation_key="ha-nfs_promote_0" operation="promote" crm-debug-origin="controld_update_resource_history" crm_feature_set="3.20.1" transition-key="7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309" transition-magic="2:1;7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309" exit-reason="Resource agent did not complete within 1m30s" on_node="memverge2" call-id="133" rc-code="1" op-status="2" interval="0" last-rc-change="1760767564" exec-time="90738" queue-time="0" op-digest="876800cf6419330beb37dc5fd73e30fb"/>
Oct 18 09:07:35.286 memverge2 pacemaker-controld  [2939] (abort_transition_graph)       info: Transition 803 aborted by operation ha-nfs_promote_0 'modify' on memverge2: Event failed | magic=2:1;7:803:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 cib=0.4055.28 source=process_graph_event:559 complete=false
Oct 18 09:07:35.286 memverge2 pacemaker-controld  [2939] (process_graph_event)  notice: Transition 803 action 7 (ha-nfs_promote_0 on memverge2): expected 'OK' but got 'Error occurred' | target-rc=0 rc=1 call-id=133
[root@memverge2 ~]#

[root@memverge2 ~]# cat /var/log/pacemaker/pacemaker.log|grep -i ha-iscsi_promote
Oct 18 09:06:03.193 memverge2 pacemaker-controld  [2939] (execute_rsc_action)   notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2 | action 71
Oct 18 09:06:03.193 memverge2 pacemaker-controld  [2939] (do_lrm_rsc_op)        notice: Requesting local execution of promote operation for ha-iscsi on memverge2 | transition_key=71:801:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 op_key=ha-iscsi_promote_0
Oct 18 09:06:03.194 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: +  /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-iscsi']/lrm_rsc_op[@id='ha-iscsi_last_0']:  @operation_key=ha-iscsi_promote_0, @operation=promote, @crm-debug-origin=controld_update_resource_history, @transition-key=71:801:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @transition-magic=-1:193;71:801:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1760767563, @exec-time=0
Oct 18 09:06:03.382 memverge2 pacemaker-controld  [2939] (log_executor_event)   notice: Result of promote operation for ha-iscsi on memverge2: OK | graph action confirmed; call=125 key=ha-iscsi_promote_0 rc=0
Oct 18 09:06:03.384 memverge2 pacemaker-controld  [2939] (process_graph_event)  info: Transition 801 action 71 (ha-iscsi_promote_0 on memverge2) confirmed: OK | rc=0 call-id=125
Oct 18 09:11:48.123 memverge2 pacemaker-controld  [2939] (execute_rsc_action)   notice: Initiating promote operation ha-iscsi_promote_0 locally on memverge2 | action 37
Oct 18 09:11:48.123 memverge2 pacemaker-controld  [2939] (do_lrm_rsc_op)        notice: Requesting local execution of promote operation for ha-iscsi on memverge2 | transition_key=37:807:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309 op_key=ha-iscsi_promote_0
Oct 18 09:11:48.124 memverge2 pacemaker-based     [2934] (cib_perform_op)       info: +  /cib/status/node_state[@id='28']/lrm[@id='28']/lrm_resources/lrm_resource[@id='ha-iscsi']/lrm_rsc_op[@id='ha-iscsi_last_0']:  @operation_key=ha-iscsi_promote_0, @operation=promote, @transition-key=37:807:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @transition-magic=-1:193;37:807:0:b9d3697a-e33b-4b53-bb46-35e2e80d0309, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1760767908, @exec-time=0
Oct 18 09:11:48.154 memverge2 pacemaker-controld  [2939] (log_executor_event)   notice: Result of promote operation for ha-iscsi on memverge2: OK | graph action confirmed; call=156 key=ha-iscsi_promote_0 rc=0
Oct 18 09:11:48.155 memverge2 pacemaker-controld  [2939] (process_graph_event)  info: Transition 807 action 37 (ha-iscsi_promote_0 on memverge2) confirmed: OK | rc=0 call-id=156
[root@memverge2 ~]#
[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

[root@memverge2 ~]#

Location Constraints:
  resource 'ipmi-fence-memverge' avoids node 'memverge' with score INFINITY (id: location-ipmi-fence-memverge-memverge--INFINITY)
  resource 'ipmi-fence-memverge2' avoids node 'memverge2' with score INFINITY (id: location-ipmi-fence-memverge2-memverge2--INFINITY)
  resource 'ha-iscsi-clone' (id: drbd-fence-by-handler-ha-iscsi-ha-iscsi-clone)
    Rules:
      Rule: role=Promoted score=-INFINITY (id: drbd-fence-by-handler-ha-iscsi-rule-ha-iscsi-clone)
        Expression: #uname ne memverge2 (id: drbd-fence-by-handler-ha-iscsi-expr-28-ha-iscsi-clone)
Colocation Constraints:
  Started resource 'g-nfs' with Promoted resource 'ha-nfs-clone' (id: colocation-g-nfs-ha-nfs-clone-INFINITY)
    score=INFINITY
  Started resource 'g-iscsi' with Promoted resource 'ha-iscsi-clone' (id: colocation-g-iscsi-ha-iscsi-clone-INFINITY)
    score=INFINITY
  Started resource 'g-nfs' with Started resource 'g-iscsi' (id: colocation-g-nfs-g-iscsi-INFINITY)
    score=INFINITY
Order Constraints:
  promote resource 'ha-nfs-clone' then start resource 'g-nfs' (id: order-ha-nfs-clone-g-nfs-mandatory)
  promote resource 'ha-iscsi-clone' then start resource 'g-iscsi' (id: order-ha-iscsi-clone-g-iscsi-mandatory)
  start resource 'g-nfs' then start resource 'g-iscsi' (id: order-g-nfs-g-iscsi-mandatory)

Anton

[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

Your DRBD resources are not connected. I don’t see the status of the resources from memverge but I suspect it’s likely in StandAlone You’ve probably split-brained the resources at some point in all your testing (DRBD 9.0 en - LINBIT).

1 Like

Hhmm… I just cross-checked it again. There are no split-brained resources at the beginning of the test. The same behaviour as week before.

Cluster with quorum, qdevice temporarily unavailable. Two nodes up and running, resource groups are running on memverge.

[root@memverge ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:yes
  volume:4 disk:UpToDate open:yes
  memverge2 role:Secondary
    volume:3 peer-disk:UpToDate
    volume:4 peer-disk:UpToDate

ha-nfs role:Primary
  volume:1 disk:UpToDate open:yes
  volume:2 disk:UpToDate open:yes
  volume:5 disk:UpToDate open:yes
  memverge2 role:Secondary
    volume:1 peer-disk:UpToDate
    volume:2 peer-disk:UpToDate
    volume:5 peer-disk:UpToDate

[root@memverge ~]#

Perform os update (dnf update, kernel 6.12.0-55.39.1 → 6.12.0-55.40.1) on memverge, reboot. Boot is blocked due to the LUK2 encryption password requirement.

Cluster without quorum, all resource groups are stopped.

Failed Resource Actions:
  * ha-nfs_promote_0 on memverge2 'Error occurred' (1): call=113, status='Timed out', exitreason='Resource agent did not complete within 1m30s', last-rc-change='Sun Oct 26 09:45:01 2025', queued=0ms, exec=90152ms

[root@memverge2 ~]# drbdadm status
# No currently configured DRBD found.
[root@memverge2 ~]#

Qdevice back online, cluster with quorum again, but resource groups are still stopped with next status.

[root@memverge2 ~]# drbdadm status
ha-iscsi role:Primary
  volume:3 disk:UpToDate open:no
  volume:4 disk:UpToDate open:no
  memverge connection:Connecting

ha-nfs role:Secondary
  volume:1 disk:Consistent open:no
  volume:2 disk:Consistent open:no
  volume:5 disk:Consistent open:no
  memverge connection:Connecting

[root@memverge2 ~]#

Anton