I would first double check that all nodes are listed as “online” according to Pacemaker. You should see a node list status towards the start of pcs status similar to the following:
Node List:
* Online: [ mds-01 mds-02 mds-0 ]
Pacemaker resources that are listed as stopped does not necessarily indicate an error, just that Pacemaker is not currently running the resources at this moment.
Full List of Resources:
* Clone Set: p_drbd_lustre_meta-clone [p_drbd_lustre_meta] (promotable):
* Stopped: [ mds-01 mds-02 mds-03 ]
If there are errors you should see messages under “failed actions” at the bottom of the pcs status output. Is there anything suspicious in the full output of pcs status?
Thanks, since I used the cluster as a dev platform, I decided to redo everything from scratch to make sure the entire setup is clean. But indeed, I should have posted all lines of the status. All three nodes were online. If I can replicate this problem again, I will revert with complete information.
The plan is to make a lustre metadata servers ring of 3+ nodes with local disks in sync. I am in DigitalOcean cloud where there is no storage to share to multiple nodes unless you make your own NFS server. But NFS is slow and the EBS is slow so I need to use instances with nvme disks attached. And there is a single disk so I need to partition my way to get a blank storage block.
My second attempt to build that hit the same problem:
[root@mds-01 ~]# pcs status
Cluster name: testfs
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: mds-03 (version 2.1.9-1.2.el9_6-49aab9983) - partition with quorum
* Last updated: Thu Jul 17 21:25:12 2025 on mds-01
* Last change: Thu Jul 17 21:22:16 2025 by root via root on mds-01
* 3 nodes configured
* 3 resource instances configured
Node List:
* Online: [ mds-01 mds-02 mds-03 ]
Full List of Resources:
* Clone Set: p_drbd_lustre_meta-clone [p_drbd_lustre_meta] (promotable):
* Stopped: [ mds-01 mds-02 mds-03 ]
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
Tried vanilla setup cf on same Rocky Linux 9.4 LVM image, with the default kernel (no lustre setup) and following the basic cluster from scratch tutorial for 2 nodes and I encounter the same problem. The very basic resource ip address is not starting
pcs status
Cluster name: mycluster
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: mds-02 (version 2.1.9-1.2.el9_6-49aab9983) - partition with quorum
* Last updated: Sat Jul 19 21:19:27 2025 on mds-01
* Last change: Sat Jul 19 21:16:34 2025 by root via root on mds-01
* 2 nodes configured
* 1 resource instance configured
Node List:
* Online: [ mds-01 mds-02 ]
Full List of Resources:
* ClusterIP (ocf:heartbeat:IPaddr2): Stopped
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
Merely a guess, but do you perhaps still have STONITH enabled? A cluster with STONITH enabled will not start any resources until after the STONITH resources have started. If none are configured nothing will ever start. If you try the below command do you see output like below?
# crm_verify -L -V
error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
If STONITH is in fact disabled and things are still refusing to start, please feel free to paste the pcs config of your super simple cluster. Might be an issue with ordering or location constraints?
Thanks. It was indeed some config line that when removed unlocked the first step - at least the cluster started to attempt running things. I believe it was ignoring lack of quorum or so.
Once this got unlock I was able to debug further and disable selinux which was blocking drbd to run
So it finally worked and I have now a functional HA lustre setup