I am trying to setup pacemaker with drdb on a lustre enable server, rocky 9.4 and drbd does not start

rvencu · July 16, 2025, 11:44pm

I followed the article here (NFS High Availability with Pacemaker & DRBD - LINBIT) and been able to complete all steps about

drbd setup
res config
manually been able to create the mirror with 2 secondary nodes
confirmed UpToDate and demoted the primary back to secondary
properly setup the pacemaker resources (simple copy from example)

lustre_meta role:Secondary
  disk:UpToDate open:no
  mds-02 role:Secondary
    peer-disk:UpToDate
  mds-03 role:Secondary
    peer-disk:UpToDate

At the step where I should see the services running it fails. Services are stopped in pcs status

Full List of Resources:
  * Clone Set: p_drbd_lustre_meta-clone [p_drbd_lustre_meta] (promotable):
    * Stopped: [ mds-01 mds-02 mds-03 ]

I got out of ideas, would like to get some advice what to try next.

Ryan · July 17, 2025, 5:36pm

Hello,

I would first double check that all nodes are listed as “online” according to Pacemaker. You should see a node list status towards the start of pcs status similar to the following:

Node List:
  * Online: [ mds-01 mds-02 mds-0 ]

Pacemaker resources that are listed as stopped does not necessarily indicate an error, just that Pacemaker is not currently running the resources at this moment.

Full List of Resources:
  * Clone Set: p_drbd_lustre_meta-clone [p_drbd_lustre_meta] (promotable):
    * Stopped: [ mds-01 mds-02 mds-03 ]

If there are errors you should see messages under “failed actions” at the bottom of the pcs status output. Is there anything suspicious in the full output of pcs status?

rvencu · July 17, 2025, 6:40pm

Thanks, since I used the cluster as a dev platform, I decided to redo everything from scratch to make sure the entire setup is clean. But indeed, I should have posted all lines of the status. All three nodes were online. If I can replicate this problem again, I will revert with complete information.

rvencu · July 17, 2025, 9:48pm

The plan is to make a lustre metadata servers ring of 3+ nodes with local disks in sync. I am in DigitalOcean cloud where there is no storage to share to multiple nodes unless you make your own NFS server. But NFS is slow and the EBS is slow so I need to use instances with nvme disks attached. And there is a single disk so I need to partition my way to get a blank storage block.

My second attempt to build that hit the same problem:

[root@mds-01 ~]# pcs status
Cluster name: testfs
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: mds-03 (version 2.1.9-1.2.el9_6-49aab9983) - partition with quorum
  * Last updated: Thu Jul 17 21:25:12 2025 on mds-01
  * Last change:  Thu Jul 17 21:22:16 2025 by root via root on mds-01
  * 3 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ mds-01 mds-02 mds-03 ]

Full List of Resources:
  * Clone Set: p_drbd_lustre_meta-clone [p_drbd_lustre_meta] (promotable):
    * Stopped: [ mds-01 mds-02 mds-03 ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

and

drbdadm status lustre_meta
lustre_meta role:Secondary
  disk:UpToDate open:no
  mds-02 role:Secondary
    peer-disk:UpToDate
  mds-03 role:Secondary
    peer-disk:UpToDate

I also ran pcs resource debug-start p_drbd_lustre_meta --full and got a big log but nothing to catch my eye in there. the short form below

pcs resource debug-start p_drbd_lustre_meta
Operation force-start for p_drbd_lustre_meta (ocf:linbit:drbd) returned 0 (ok)

Maybe this is interesting too:

uname -r
5.14.0-427.31.1_lustre.el9.x86_64

also

pcs resource failcount show drbd_r0
No failcounts for resource 'drbd_r0'

ok, finally I tried this command:

pcs resource debug-promote p_drbd_lustre_meta-clone

then

drbdadm status lustre_meta
lustre_meta role:Primary
  disk:UpToDate open:no
  mds-02 role:Secondary
    peer-disk:UpToDate
  mds-03 role:Secondary
    peer-disk:UpToDate

and

Full List of Resources:
  * Clone Set: p_drbd_lustre_meta-clone [p_drbd_lustre_meta] (promotable):
    * Stopped: [ mds-01 mds-02 mds-03 ]

The debug-demote also works well.

rvencu · July 19, 2025, 9:22pm

Tried vanilla setup cf on same Rocky Linux 9.4 LVM image, with the default kernel (no lustre setup) and following the basic cluster from scratch tutorial for 2 nodes and I encounter the same problem. The very basic resource ip address is not starting

pcs status
Cluster name: mycluster
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: mds-02 (version 2.1.9-1.2.el9_6-49aab9983) - partition with quorum
  * Last updated: Sat Jul 19 21:19:27 2025 on mds-01
  * Last change:  Sat Jul 19 21:16:34 2025 by root via root on mds-01
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ mds-01 mds-02 ]

Full List of Resources:
  * ClusterIP	(ocf:heartbeat:IPaddr2):	 Stopped

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

So I can confirm this has nothing to do with DRBD

Devin · July 23, 2025, 6:44pm

Merely a guess, but do you perhaps still have STONITH enabled? A cluster with STONITH enabled will not start any resources until after the STONITH resources have started. If none are configured nothing will ever start. If you try the below command do you see output like below?

# crm_verify -L -V
   error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
   error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
   error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid

If STONITH is in fact disabled and things are still refusing to start, please feel free to paste the pcs config of your super simple cluster. Might be an issue with ordering or location constraints?

rvencu · July 23, 2025, 8:11pm

Thanks. It was indeed some config line that when removed unlocked the first step - at least the cluster started to attempt running things. I believe it was ignoring lack of quorum or so.

Once this got unlock I was able to debug further and disable selinux which was blocking drbd to run

So it finally worked and I have now a functional HA lustre setup

Topic		Replies	Views
Linstor DB/Controller HA LINSTOR drbd , linstor , ubuntu	2	165	January 16, 2025
Linstor resources up-to-date but no quorum on a three nodes PVE cluster Proxmox VE drbd	0	231	October 23, 2024
Drbd/pacemaker integration question General drbd	10	131	June 26, 2025
Could not connect to any LINSTOR controller (after HA) LINSTOR drbd	1	461	January 21, 2025
Recover from abrupt server shutdown General	6	120	July 24, 2024

I am trying to setup pacemaker with drdb on a lustre enable server, rocky 9.4 and drbd does not start

Related topics