Quorum lost when performing multiple clones

kreeuwijk · May 16, 2025, 12:42pm

When we use Linstor (Piraeus) as storage for Kubevirt, we are seeing a weakness specifically when cloning PVCs. If multiple new VMs are created that perform a CSI-assisted clone of a golden image PVC, we can see triggering quorum messages in the Linstor controller. Quorum status is flaky between getting lost and regained again during the cloning operations. If too many clones are triggered at once, quorum gets lost permanently, the entire storage cluster crashes and cannot be recovered.

We don’t see this problem when starting large amounts of VMs (for which VM disk VPCs already exist), so this isn’t related to IOPS performance or anything. It only happens when performing cloning operations.

Is there something that can be done about this?

wanzenbug · June 4, 2025, 1:06pm

With losing quorum, I assume the DRBD resources are losing quorum? That would indicate that there is an overload of some kind on the Clusters resources, i.e. one node, probably the one with the “master” volume, is so busy that it can’t serve DRBD requests, so it looks like it times out for the other nodes.

For specific recommendations, we would need to look at some logs. One simple thing would be to decrease the number of clones created in parallel. You can tune the settings on the “csi-snapshotter” container with something like

apiVersion: piraeus.io/v1
kind: LinstorCluster
...
spec:
  csiController:
    podTemplate:
      spec:
        containers:
        - name: csi-snapshotter
          args:
          - --timeout=1m
          - --csi-address=$(ADDRESS)
          - --leader-election=true
          - --leader-election-namespace=$(NAMESPACE)
          - --worker-threads=10 # change this

kreeuwijk · June 5, 2025, 12:54pm

Ok, do you know how many worker threads are created by default?
For reference, performing the same action on Portworx doesn’t cause any problems, which means it has better protection against overloading itself.

I can definitely recreate the scenario and provide you with logs, please tell me which logs you would like.

wanzenbug · June 5, 2025, 1:11pm

10 are the current default. Which is actually rather low, so I am surprised that would cause any issues.

If you can reliabily reproduce the issue, please create a LINSTOR sos report (kubectl linstor sos-report download if you are using GitHub - piraeusdatastore/kubectl-linstor: A plugin to control a LINSTOR cluster using kubectl, otherwise kubectl exec deploy/linstor-controller -- linstor sos-report create and then kubectl cp the file from the pod.

kreeuwijk · June 5, 2025, 4:16pm

Ok yeah I noticed that especially with Windows images I could only clone 2-3 at the same time in order to not cause operational disruption. The quorum state would continuously go between in-quorum and not-in-quorum but at least never to a point that things permanently failed. With smaller Linux images I could clone about 5-7 at the same time.

I will generate some logs when it’s busy but not fully failing, as well as when I ask it to clone 120 VMs (with default number of worker threads), which causes the whole storagecluster to fail.

kreeuwijk · July 11, 2025, 12:35pm

I did more testing and saw things were already more stable (but not entirely) with 2.8 and now I retested again with 2.9.0 and that is rock-steady under massive cloning load.

So it looks like the issue is mostly tied to 2.7 and fully fixed in 2.9
Great job!

Topic		Replies	Views
Linstor resources up-to-date but no quorum on a three nodes PVE cluster Proxmox VE drbd	0	199	October 23, 2024
Add ability to not count diskless nodes in quorum DRBD kubernetes , drbd	0	43	February 18, 2025
Missing TieBreaker devices LINSTOR	4	14	July 30, 2025
Linstor-controller just went down LINSTOR latest	10	453	January 30, 2025
LINSTOR on Two Nodes? LINSTOR	4	823	May 10, 2025

Quorum lost when performing multiple clones

Related topics