When we use Linstor (Piraeus) as storage for Kubevirt, we are seeing a weakness specifically when cloning PVCs. If multiple new VMs are created that perform a CSI-assisted clone of a golden image PVC, we can see triggering quorum messages in the Linstor controller. Quorum status is flaky between getting lost and regained again during the cloning operations. If too many clones are triggered at once, quorum gets lost permanently, the entire storage cluster crashes and cannot be recovered.
We don’t see this problem when starting large amounts of VMs (for which VM disk VPCs already exist), so this isn’t related to IOPS performance or anything. It only happens when performing cloning operations.
Is there something that can be done about this?