Drbd/pacemaker integration question

Hello

There is two nodes active/standby cluster, based on Rocky Linux 9.6 and kmod-drbd9x-9.2.14-1.el9_6.elrepo
There are two services (ha-nfs, ha-iscsi) which are always run together on the same cluster node.

If I reboot the active cluster node memverge2, ha-nfs resource successfully switches to standby node memverge, but ha-iscsi resource fails.

Looking for the difference, in the logs -

resource ha-nfs

Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-nfs on memverge
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-nfs on memverge: ok
Jun 7 09:50:13 memverge kernel: drbd ha-nfs: Preparing remote state change 1671163414: 28->all role( Secondary )
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: Committing remote state change 1671163414 (primary_nodes=0)
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: peer( Primary → Secondary ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-nfs/29 drbd1: Enabling local AL-updates
Jun 7 09:50:13 memverge kernel: drbd ha-nfs/30 drbd2: Enabling local AL-updates
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-nfs on memverge
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-nfs on memverge: ok
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-nfs on memverge
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-nfs on memverge: ok
Jun 7 09:50:13 memverge kernel: drbd ha-nfs: Preparing remote state change 374911319: 28->27 conn( Disconnecting )
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: Committing remote state change 374911319 (primary_nodes=0)
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: conn( Connected → TearDown ) peer( Secondary → Unknown ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-nfs/29 drbd1 memverge2: pdsk( UpToDate → DUnknown ) repl( Established → Off ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-nfs/30 drbd2 memverge2: pdsk( UpToDate → DUnknown ) repl( Established → Off ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: Terminating sender thread
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: Starting sender thread (peer-node-id 28)
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: Connection closed
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: helper command: /sbin/drbdadm disconnected
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: helper command: /sbin/drbdadm disconnected exit code 0
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: conn( TearDown → Unconnected ) [disconnected]
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: Restarting receiver thread
Jun 7 09:50:13 memverge kernel: drbd ha-nfs memverge2: conn( Unconnected → Connecting ) [connecting]
Jun 7 09:50:13 memverge pacemaker-attrd[2777]: notice: Setting master-ha-nfs[memverge2] in instance_attributes: 10000 → (unset)
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-nfs on memverge
Jun 7 09:50:13 memverge pacemaker-attrd[2777]: notice: Setting master-ha-nfs[memverge] in instance_attributes: 10000 → 1000
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-nfs on memverge: ok
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-nfs on memverge
Jun 7 09:50:14 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-nfs on memverge: ok
Jun 7 09:50:14 memverge pacemaker-controld[2780]: notice: Requesting local execution of promote operation for ha-nfs on memverge
Jun 7 09:50:14 memverge kernel: drbd ha-nfs memverge2: helper command: /sbin/drbdadm fence-peer
Jun 7 09:50:14 memverge crm-fence-peer.9.sh[4985]: DRBD_BACKING_DEV_29=/dev/block_nfs_vg/ha_nfs_internal_lv DRBD_BACKING_DEV_30=/dev/block_nfs_vg/ha_nfs_exports_lv DRBD_CONF=/etc/drbd.conf DRBD_CSTATE=Connecting DRBD_LL_DISK=/dev/block_nfs_vg/ha_nfs_internal_lv\ /dev/block_nfs_vg/ha_nfs_exports_lv DRBD_MINOR=1\ 2 DRBD_MINOR_29=1 DRBD_MINOR_30=2 DRBD_MY_ADDRESS=192.168.0.6 DRBD_MY_AF=ipv4 DRBD_MY_NODE_ID=27 DRBD_NODE_ID_27=memverge DRBD_NODE_ID_28=memverge2 DRBD_PEER_ADDRESS=192.168.0.8 DRBD_PEER_AF=ipv4 DRBD_PEER_NODE_ID=28 DRBD_RESOURCE=ha-nfs DRBD_VOLUME=29\ 30 UP_TO_DATE_NODES=0x08000000 /usr/lib/drbd/crm-fence-peer.9.sh
Jun 7 09:50:14 memverge crm-fence-peer.9.sh[4985]: INFO peers are reachable, my disk is UpToDate UpToDate: placed constraint ‘drbd-fence-by-handler-ha-nfs-ha-nfs-clone’
Jun 7 09:50:14 memverge kernel: drbd ha-nfs memverge2: helper command: /sbin/drbdadm fence-peer exit code 4 (0x400)
Jun 7 09:50:14 memverge kernel: drbd ha-nfs memverge2: fence-peer helper returned 4 (peer was fenced)
Jun 7 09:50:14 memverge kernel: drbd ha-nfs/29 drbd1 memverge2: pdsk( DUnknown → Outdated ) [primary]
Jun 7 09:50:14 memverge kernel: drbd ha-nfs/30 drbd2 memverge2: pdsk( DUnknown → Outdated ) [primary]
Jun 7 09:50:14 memverge kernel: drbd ha-nfs: Preparing cluster-wide state change 700209656: 27->all role( Primary )
Jun 7 09:50:14 memverge kernel: drbd ha-nfs: Committing cluster-wide state change 700209656 (0ms)
Jun 7 09:50:14 memverge kernel: drbd ha-nfs: role( Secondary → Primary ) [primary]

resource ha-iscsi, looks pacemaker is waiting while node memverge2 is booted and promoted it there

Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-iscsi on memverge
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-iscsi on memverge: ok
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi: Preparing remote state change 155406647: 28->all role( Secondary )
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: Committing remote state change 155406647 (primary_nodes=0)
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: peer( Primary → Secondary ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi/31 drbd3: Enabling local AL-updates
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-iscsi on memverge
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-iscsi on memverge: ok
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-iscsi on memverge
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-iscsi on memverge: ok
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi: Preparing remote state change 2424298786: 28->27 conn( Disconnecting )
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: Committing remote state change 2424298786 (primary_nodes=0)
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: conn( Connected → TearDown ) peer( Secondary → Unknown ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: pdsk( UpToDate → DUnknown ) repl( Established → Off ) [remote]
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: Terminating sender thread
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: Starting sender thread (peer-node-id 28)
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: Connection closed
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: helper command: /sbin/drbdadm disconnected
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: helper command: /sbin/drbdadm disconnected exit code 0
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: conn( TearDown → Unconnected ) [disconnected]
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: Restarting receiver thread
Jun 7 09:50:13 memverge kernel: drbd ha-iscsi memverge2: conn( Unconnected → Connecting ) [connecting]
Jun 7 09:50:13 memverge pacemaker-attrd[2777]: notice: Setting master-ha-iscsi[memverge2] in instance_attributes: 10000 → (unset)
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-iscsi on memverge
Jun 7 09:50:13 memverge pacemaker-attrd[2777]: notice: Setting master-ha-iscsi[memverge] in instance_attributes: 10000 → 1000
Jun 7 09:50:13 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-iscsi on memverge: ok
Jun 7 09:53:25 memverge pacemaker-schedulerd[2779]: notice: Actions: Start ha-iscsi:1 ( memverge2 )
Jun 7 09:53:25 memverge pacemaker-controld[2780]: notice: Initiating monitor operation ha-iscsi:1_monitor_0 on memverge2
Jun 7 09:53:25 memverge pacemaker-controld[2780]: notice: Initiating notify operation ha-iscsi_pre_notify_start_0 locally on memverge
Jun 7 09:53:25 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-iscsi on memverge
Jun 7 09:53:25 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-iscsi on memverge: ok
Jun 7 09:53:25 memverge pacemaker-controld[2780]: notice: Initiating start operation ha-iscsi:1_start_0 on memverge2
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi memverge2: Handshake to peer 28 successful: Agreed network protocol version 122
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi memverge2: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi: Preparing cluster-wide state change 1709502368: 27->28 role( Secondary ) conn( Connected )
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: drbd_sync_handshake:
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: self A0B1026CDF591CD6:0000000000000000:29A475E429B9542C:5C9280E42D30ABB6 bits:0 flags:120
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: peer A0B1026CDF591CD6:0000000000000000:66B01940CA59D348:CB1BE80494B0304E bits:0 flags:1020
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: uuid_compare()=no-sync by rule=lost-quorum
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi: State change 1709502368: primary_nodes=0, weak_nodes=0
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi: Committing cluster-wide state change 1709502368 (14ms)
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi memverge2: conn( Connecting → Connected ) peer( Unknown → Secondary ) [connected]
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: pdsk( DUnknown → Consistent ) repl( Off → Established ) [connected]
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: cleared bm UUID and bitmap A0B1026CDF591CD6:0000000000000000:29A475E429B9542C:5C9280E42D30ABB6
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi/31 drbd3 memverge2: pdsk( Consistent → UpToDate ) [peer-state]
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi memverge2: helper command: /sbin/drbdadm unfence-peer
Jun 7 09:53:27 memverge kernel: drbd ha-iscsi memverge2: helper command: /sbin/drbdadm unfence-peer exit code 0
Jun 7 09:53:27 memverge pacemaker-attrd[2777]: notice: Setting master-ha-iscsi[memverge2] in instance_attributes: (unset) → 10000
Jun 7 09:53:27 memverge pacemaker-controld[2780]: notice: Transition 2 aborted by status-28-master-ha-iscsi doing create master-ha-iscsi=10000: Transient attribute change
Jun 7 09:53:27 memverge pacemaker-controld[2780]: notice: Initiating notify operation ha-iscsi_post_notify_start_0 locally on memverge
Jun 7 09:53:27 memverge pacemaker-controld[2780]: notice: Requesting local execution of notify operation for ha-iscsi on memverge
Jun 7 09:53:27 memverge pacemaker-controld[2780]: notice: Initiating notify operation ha-iscsi:1_post_notify_start_0 on memverge2
Jun 7 09:53:27 memverge pacemaker-controld[2780]: notice: Result of notify operation for ha-iscsi on memverge: ok
Jun 7 09:53:49 memverge pacemaker-schedulerd[2779]: notice: Actions: Promote ha-iscsi:0 ( Unpromoted → Promoted memverge2 )

Finally drbdadm status shows

[root@memverge anton]# drbdadm status
ha-iscsi role:Secondary
volume:31 disk:UpToDate
memverge2 role:Primary
volume:31 peer-disk:UpToDate

ha-nfs role:Primary
volume:29 disk:UpToDate
volume:30 disk:UpToDate
memverge2 role:Secondary
volume:29 peer-disk:UpToDate
volume:30 peer-disk:UpToDate

As result resource ha-iscsi failed because it can’t be started on the same cluster node as resource ha-nfs.

Any ideas why there is the difference in the behavior between ha-nfs and ha-iscsi resources ?

Anton

I had to copy and paste the logs lines out of this message and into a text editor to read. Please use code blocks (“preformatted text”, the “</>” option in the toolbar) for logs and shell output in the future.

I suspect the reason the ha-iscsi resources didn’t fail-over is going to be logged, but just isn’t in the snippets included. Was the reboot of memverg2 graceful? If so, did all the services stop cleanly? If so, (or a hard reboot) then there should be somewhere in the logs where memverge attempts to start ha-iscsi, but possibly fails.

I suspect the reason here is going to be more Pacemaker related then DRBD related. As such I would be paying closer attention to the Pacemaker logs. Might also be related to DRBD’s resource level fencing (which interfaces with Pacemaker), so pay attention to the crm-fence-peer.9.sh log lines as well.

1 Like

I had to copy and paste the logs lines out of this message and into a text editor to read.
Please use code blocks (“preformatted text”, the “</>” option in the toolbar) for logs and shell output in the future.

Ok, sorry about that.

Was the reboot of memverg2 graceful? If so, did all the services stop cleanly?
If so, (or a hard reboot) then there should be somewhere in the logs where memverge attempts to start ha-iscsi, but possibly fails.

I just typed command “reboot” and pressed enter.

For ha-nfs resource there is the record in the logs -

memverge pacemaker-controld[2780]: notice: Requesting local execution of promote operation for ha-nfs on memverge

However there is no such record for ha-iscsi resource.

I decided to update the cluster to Rocky Linux 10.0 and pacemaker 3.0

If the issue exists, I’ll let you know.

What is the better, keep two drbd resource files (for iscsi and nfs), or better keep all drbd resources in the single file ?

Anton

Are you implying that the upgrade has resolved this issue?

It should make no difference, but the standard practice is a separate file for each resource.