Quick DR & HA through new x-replicas-on-different feature

I was thrilled to learn about the upcoming x-replicas-on-different feature in LINSTOR. For those who might be unfamiliar:

About a year ago I opened an issue on the Piraeus operator repo in which I wrote:

I’ve been exploring multi-availability zone setups offered by various cloud providers, aiming to architect a robust DR solution for potential datacenter failures. With the Kubernetes control plane being HA in most configurations and resilient to a datacenter outage, I’m keen on ensuring a similar resilience for my storage layer.

My primary objective is to operate predominantly in one zone (datacenter) while maintaining an additional asynchronous replica for each resource definition in another zone. This setup would act as a safety net, enabling a swift switch to the standby zone with minimal RTO and RPO should the primary zone encounter issues. While the latency between AZs is generally low, I’m specifically looking for an asynchronous solution to ensure maximum performance in the primary zone without being impacted by any inter-zone communication delays. Additionally, even with low latency, the asynchronous setup provides a buffer against any unforeseen network anomalies between zones.

While the piraeus-ha-controller has been instrumental for quick failovers, its quorum-based scheduling poses challenges. Specifically, achieving quorum becomes problematic if the primary zone goes offline. Additionally, the current placement parameters make it challenging, if not impossible, to schedule X replicas in zone A and Y replicas in zone B.

I’ve come across setups using DRBD with Pacemaker and Booth for similar requirements. It got me wondering if we could have something akin to that but tailored for a single Kubernetes cluster environment. Perhaps an additional controller that could manage this.

My questions:

  1. With the new x-replicas-on-different feature its now possiblee to have two replica’s in one datacenter (DC) and a third in a different DC. The blog posts title by @kermat mentions “High Availability & Disaster Recovery”, but it doesn’t fully explain what happens in the event of a datacenter failure. Would we still need three datacenters to maintain quorum if one goes down?

  2. Can we configure asynchronous communication between datacenters based on placement parameters? As mentioned in my issue, I prefer to avoid synchronous communication between datacenters to prevent performance impacts due to inter-zone delays.

  3. Overall, is the scenario I described now achievable with this new feature? :innocent:

I think I just found the answer to the second question. It seems asynchronous communication between datacenters can be configured by defining a LinstorNodeConnection. For this to work properly I guess DRBD proxy would be required?

I also just discovered this video on Geo-clustering with Pacemaker & DRBD Proxy by @Ryan. This is basically the vision which I’m pursuing :star_struck: , but I would like to have it work with K8S & Linstor I guess.

Hello, let me address one point after the other:

  1. " what happens in the event of a datacenter failure. Would we still need three datacenters to maintain quorum if one goes down?"

I am not sure if we are on the same page here. A datacenter is not a replica. With --x-replicas-on-different datacenter 2 you tell LINSTOR to have two replicas per datacenter. For example if you have 3 nodes per datacenter, let us call them A1, A2 and A3 for datacenter A and B1, B2, B3 for datacenter B, etc… spawning a new resource with --x-replicas-on-different datacenter 2 --place-count 3 will for example choose A1, A2 and B2 as nodes where each node has a replica of your data.

Coming back to your question “what happens in the event of a datacenter failure”. This depends on what datacenter fails. If datacenter B fails, you still have replicas A1 and A2 up and running, and therefore you still have quorum.
If datacenter A fails 2 out of 3 peers would be offline / unavailable, which means you do not have quorum and therefore also cannot access the data (without manual intervention).

If we modify the example by increasing the peers to 4 with --place-count 4 and we further assume the new, 4th replica is created on B3, you will additionally get a tiebreaking diskless resource (which we do not call a replica, since it is diskless) on lets say C1.
Now if either datacenter A or datacenter B goes offline, you would still have 3 out of 5 peers online, so you would keep quorum since diskless resources also count in quorum-voting.

  1. “Can we configure asynchronous communication between datacenters based on placement parameters?”

Yes, using NodeConnections you can achieve what you are looking for. We are also planning other, easier ways to configure what you are planning, but that requires some more development.
Although you can, you do not need to have DRBD proxy for this. DRBD itself (without proxy) can be configured with synchronous protocol “C”, or asynchronous protocol “A” (there is also “B”, but that is more of a special case and usually not what you are looking for). You can configure LINSTOR to tell DRBD to use protocol “A” between cross-datacenter nodes.
Proxy on the other hand would give you larger buffers, but for the first attempt, I would just try and see if protocol A already fits your needs.

I would also like to point out here, that DRBD Proxy is a proprietary product. You will need a subscription to use it. The different protocols on the other hand are included in DRBD itself which is another reason for me to suggest to first test with protocol “A”. If that is not good enough for you, feel free get in contact with us about the details of using DRBD Proxy.

  1. “Overall, is the scenario I described now achievable with this new feature?”

Well, yes, but there are some things to consider here. Both, protocol A and especially using Proxy will not guarantee your data to have received (and applied) the latest packets from the application.

Thank you for the detailed response, @ghernadi. Your comments seem to align with my initial assumptions.

Well, yes, but there are some things to consider here. Both, protocol A and especially using Proxy will not guarantee your data to have received (and applied) the latest packets from the application.

For our use case, this is acceptable. As long as the data remains uncorrupted—meaning, for instance, that a MySQL database can start without issues—a small amount of data loss is preferable to extended downtime. Additionally, we plan for the failover to Datacenter B to be a manual operation.


I’ve scetched a simple diagram of what I think --x-replicas-on-different datacenter 2 --place-count 4 would look like.

In the diagram above I’ve modelled three datacenters:

  • Datacenter A: This is the primary datacenter where all workloads run during normal operations.
  • Datacenter B: This serves as the disaster recovery site if Datacenter A goes down.
  • Datacenter C: This acts solely as an arbitrator to maintain quorum in case either Datacenter A or B fails.

My plan is to have all nodes in Datacenters B and C either tainted or cordoned to prevent workloads from being scheduled there during normal operations. In the event that Datacenter A goes down, we can uncordon the nodes in Datacenter B, allowing workloads to migrate there. This was the first approach that came to mind, but I’m interested to know if you or others have alternative suggestions on how to ensure that Datacenter A functions as the primary during normal operations.

Based on the above design. Some follow-up questions come to mind:

  1. Is it possible to use Datacenter C exclusively as an arbitrator, ensuring that the diskful replicas are always placed in Datacenters A and B?
  2. Is it correct that the tiebreaker or diskless resource doesn’t actually receive any data and is used only for maintaining quorum? If so, does the higher latency to Datacenter C have any affect on its function?

The blog was written generically for LINSTOR, and I’ve not personally attempted to stretch a Kubernetes cluster like this (yet), so do keep that in mind :sweat_smile:

That sounds like a solid plan to me. You can add/override tolerations to the operator deployed podTemplates as mentioned in the docs here.

Yes, you could do this, but I think you’ll need to do some “manual” configurations from within the LINSTOR controller pod. Essentially, you’ll only create the diskful storage-pool on the nodes in DC A and B, then create a new (non-default) diskless storage pool in DC C, and then define that diskless storage-pool in the StorageClass definition.

That sounds straightforward, but the CSI driver doesn’t support all the parameters needed yet. So currently it’s a mix of command line configured options and StorageClass defined options. I’ve opened an issue internally to add the xReplicasOnDifferent StorageClass parameter.

LINSTOR ==> storage-pool create diskless kube-0 arbitrator-pool                                
SUCCESS:                                                                                      
Description:                                                                                                                                                                                 
    New storage pool 'arbitrator-pool' on node 'kube-0' registered.                                                                                                                          
Details:                                                                                                                                                                                     
    Storage pool 'arbitrator-pool' on node 'kube-0' UUID is: 494e6cff-458b-4b79-b538-6b39b3d14173                                                                                            
SUCCESS:                                                                                                                                                                                     
    (kube-0) Changes applied to storage pool 'arbitrator-pool'                                                                                                                           

and

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor-csi-lvm-thin-r2"
provisioner: linstor.csi.linbit.com
parameters:
  autoPlace: "2"
  storagePool: "lvm-thin"
  disklessStoragePool: "arbitrator-pool"
reclaimPolicy: Delete
allowVolumeExpansion: true

but then also you need to set the aux props on the nodes and set the --x-replicas-on-different value on the resource-group. Another issue, resource-group doesn’t get created until the first LINSTOR volume is provisioned from it.

So kind of messy… something I’ll look at closer next week I’m sure.

That’s correct.

1 Like

Thank you for your input, @kermat. If I’m not mistaken, it’s currently not possible to create a diskful storage pool exclusively for Datacenters A and B using Kubernetes manifests right?

Implementing such a design on a production cluster might be premature at this stage. However, it’s a setup we would be eager to adopt if it becomes feasible in the future.

Looking forward to any updates or suggestions you might have!

You can do that, actually. Following the example in the Piraeus GitHub repo.

Label your nodes:

root@kube-0:~# kubectl get nodes
NAME     STATUS   ROLES           AGE    VERSION
kube-0   Ready    <none>          7d5h   v1.28.14
kube-1   Ready    <none>          7d5h   v1.28.14
kube-2   Ready    <none>          7d5h   v1.28.14

root@kube-0:~# kubectl label node kube-1 piraeus.io/storageNode=yes
node/kube-1 labeled

root@kube-0:~# kubectl label node kube-2 piraeus.io/storageNode=yes                                                                                                                                                                                        
node/kube-2 labeled

Then use that label as a nodeSelector in your LinstorSatelliteConfiguration:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: diskful-storage-satellites
spec:
  nodeSelector:
    piraeus.io/storageNode: "yes" 
  storagePools:
    - name: diskful-pool
      lvmThinPool:
        volumeGroup: drbdpool
        thinPool: diskful

I had created the thin LVM pool on my nodes already, but you could just as easily list some physical devices to have LINSTOR do that for you.