Quick DR & HA through new x-replicas-on-different feature

I was thrilled to learn about the upcoming x-replicas-on-different feature in LINSTOR. For those who might be unfamiliar:

About a year ago I opened an issue on the Piraeus operator repo in which I wrote:

I’ve been exploring multi-availability zone setups offered by various cloud providers, aiming to architect a robust DR solution for potential datacenter failures. With the Kubernetes control plane being HA in most configurations and resilient to a datacenter outage, I’m keen on ensuring a similar resilience for my storage layer.

My primary objective is to operate predominantly in one zone (datacenter) while maintaining an additional asynchronous replica for each resource definition in another zone. This setup would act as a safety net, enabling a swift switch to the standby zone with minimal RTO and RPO should the primary zone encounter issues. While the latency between AZs is generally low, I’m specifically looking for an asynchronous solution to ensure maximum performance in the primary zone without being impacted by any inter-zone communication delays. Additionally, even with low latency, the asynchronous setup provides a buffer against any unforeseen network anomalies between zones.

While the piraeus-ha-controller has been instrumental for quick failovers, its quorum-based scheduling poses challenges. Specifically, achieving quorum becomes problematic if the primary zone goes offline. Additionally, the current placement parameters make it challenging, if not impossible, to schedule X replicas in zone A and Y replicas in zone B.

I’ve come across setups using DRBD with Pacemaker and Booth for similar requirements. It got me wondering if we could have something akin to that but tailored for a single Kubernetes cluster environment. Perhaps an additional controller that could manage this.

My questions:

  1. With the new x-replicas-on-different feature its now possiblee to have two replica’s in one datacenter (DC) and a third in a different DC. The blog posts title by @kermat mentions “High Availability & Disaster Recovery”, but it doesn’t fully explain what happens in the event of a datacenter failure. Would we still need three datacenters to maintain quorum if one goes down?

  2. Can we configure asynchronous communication between datacenters based on placement parameters? As mentioned in my issue, I prefer to avoid synchronous communication between datacenters to prevent performance impacts due to inter-zone delays.

  3. Overall, is the scenario I described now achievable with this new feature? :innocent:

I think I just found the answer to the second question. It seems asynchronous communication between datacenters can be configured by defining a LinstorNodeConnection. For this to work properly I guess DRBD proxy would be required?

I also just discovered this video on Geo-clustering with Pacemaker & DRBD Proxy by @Ryan. This is basically the vision which I’m pursuing :star_struck: , but I would like to have it work with K8S & Linstor I guess.

Hello, let me address one point after the other:

  1. " what happens in the event of a datacenter failure. Would we still need three datacenters to maintain quorum if one goes down?"

I am not sure if we are on the same page here. A datacenter is not a replica. With --x-replicas-on-different datacenter 2 you tell LINSTOR to have two replicas per datacenter. For example if you have 3 nodes per datacenter, let us call them A1, A2 and A3 for datacenter A and B1, B2, B3 for datacenter B, etc… spawning a new resource with --x-replicas-on-different datacenter 2 --place-count 3 will for example choose A1, A2 and B2 as nodes where each node has a replica of your data.

Coming back to your question “what happens in the event of a datacenter failure”. This depends on what datacenter fails. If datacenter B fails, you still have replicas A1 and A2 up and running, and therefore you still have quorum.
If datacenter A fails 2 out of 3 peers would be offline / unavailable, which means you do not have quorum and therefore also cannot access the data (without manual intervention).

If we modify the example by increasing the peers to 4 with --place-count 4 and we further assume the new, 4th replica is created on B3, you will additionally get a tiebreaking diskless resource (which we do not call a replica, since it is diskless) on lets say C1.
Now if either datacenter A or datacenter B goes offline, you would still have 3 out of 5 peers online, so you would keep quorum since diskless resources also count in quorum-voting.

  1. “Can we configure asynchronous communication between datacenters based on placement parameters?”

Yes, using NodeConnections you can achieve what you are looking for. We are also planning other, easier ways to configure what you are planning, but that requires some more development.
Although you can, you do not need to have DRBD proxy for this. DRBD itself (without proxy) can be configured with synchronous protocol “C”, or asynchronous protocol “A” (there is also “B”, but that is more of a special case and usually not what you are looking for). You can configure LINSTOR to tell DRBD to use protocol “A” between cross-datacenter nodes.
Proxy on the other hand would give you larger buffers, but for the first attempt, I would just try and see if protocol A already fits your needs.

I would also like to point out here, that DRBD Proxy is a proprietary product. You will need a subscription to use it. The different protocols on the other hand are included in DRBD itself which is another reason for me to suggest to first test with protocol “A”. If that is not good enough for you, feel free get in contact with us about the details of using DRBD Proxy.

  1. “Overall, is the scenario I described now achievable with this new feature?”

Well, yes, but there are some things to consider here. Both, protocol A and especially using Proxy will not guarantee your data to have received (and applied) the latest packets from the application.

Thank you for the detailed response, @ghernadi. Your comments seem to align with my initial assumptions.

Well, yes, but there are some things to consider here. Both, protocol A and especially using Proxy will not guarantee your data to have received (and applied) the latest packets from the application.

For our use case, this is acceptable. As long as the data remains uncorrupted—meaning, for instance, that a MySQL database can start without issues—a small amount of data loss is preferable to extended downtime. Additionally, we plan for the failover to Datacenter B to be a manual operation.


I’ve scetched a simple diagram of what I think --x-replicas-on-different datacenter 2 --place-count 4 would look like.

In the diagram above I’ve modelled three datacenters:

  • Datacenter A: This is the primary datacenter where all workloads run during normal operations.
  • Datacenter B: This serves as the disaster recovery site if Datacenter A goes down.
  • Datacenter C: This acts solely as an arbitrator to maintain quorum in case either Datacenter A or B fails.

My plan is to have all nodes in Datacenters B and C either tainted or cordoned to prevent workloads from being scheduled there during normal operations. In the event that Datacenter A goes down, we can uncordon the nodes in Datacenter B, allowing workloads to migrate there. This was the first approach that came to mind, but I’m interested to know if you or others have alternative suggestions on how to ensure that Datacenter A functions as the primary during normal operations.

Based on the above design. Some follow-up questions come to mind:

  1. Is it possible to use Datacenter C exclusively as an arbitrator, ensuring that the diskful replicas are always placed in Datacenters A and B?
  2. Is it correct that the tiebreaker or diskless resource doesn’t actually receive any data and is used only for maintaining quorum? If so, does the higher latency to Datacenter C have any affect on its function?

The blog was written generically for LINSTOR, and I’ve not personally attempted to stretch a Kubernetes cluster like this (yet), so do keep that in mind :sweat_smile:

That sounds like a solid plan to me. You can add/override tolerations to the operator deployed podTemplates as mentioned in the docs here.

Yes, you could do this, but I think you’ll need to do some “manual” configurations from within the LINSTOR controller pod. Essentially, you’ll only create the diskful storage-pool on the nodes in DC A and B, then create a new (non-default) diskless storage pool in DC C, and then define that diskless storage-pool in the StorageClass definition.

That sounds straightforward, but the CSI driver doesn’t support all the parameters needed yet. So currently it’s a mix of command line configured options and StorageClass defined options. I’ve opened an issue internally to add the xReplicasOnDifferent StorageClass parameter.

LINSTOR ==> storage-pool create diskless kube-0 arbitrator-pool                                
SUCCESS:                                                                                      
Description:                                                                                                                                                                                 
    New storage pool 'arbitrator-pool' on node 'kube-0' registered.                                                                                                                          
Details:                                                                                                                                                                                     
    Storage pool 'arbitrator-pool' on node 'kube-0' UUID is: 494e6cff-458b-4b79-b538-6b39b3d14173                                                                                            
SUCCESS:                                                                                                                                                                                     
    (kube-0) Changes applied to storage pool 'arbitrator-pool'                                                                                                                           

and

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor-csi-lvm-thin-r2"
provisioner: linstor.csi.linbit.com
parameters:
  autoPlace: "2"
  storagePool: "lvm-thin"
  disklessStoragePool: "arbitrator-pool"
reclaimPolicy: Delete
allowVolumeExpansion: true

but then also you need to set the aux props on the nodes and set the --x-replicas-on-different value on the resource-group. Another issue, resource-group doesn’t get created until the first LINSTOR volume is provisioned from it.

So kind of messy… something I’ll look at closer next week I’m sure.

That’s correct.

1 Like

Thank you for your input, @kermat. If I’m not mistaken, it’s currently not possible to create a diskful storage pool exclusively for Datacenters A and B using Kubernetes manifests right?

Implementing such a design on a production cluster might be premature at this stage. However, it’s a setup we would be eager to adopt if it becomes feasible in the future.

Looking forward to any updates or suggestions you might have!

You can do that, actually. Following the example in the Piraeus GitHub repo.

Label your nodes:

root@kube-0:~# kubectl get nodes
NAME     STATUS   ROLES           AGE    VERSION
kube-0   Ready    <none>          7d5h   v1.28.14
kube-1   Ready    <none>          7d5h   v1.28.14
kube-2   Ready    <none>          7d5h   v1.28.14

root@kube-0:~# kubectl label node kube-1 piraeus.io/storageNode=yes
node/kube-1 labeled

root@kube-0:~# kubectl label node kube-2 piraeus.io/storageNode=yes                                                                                                                                                                                        
node/kube-2 labeled

Then use that label as a nodeSelector in your LinstorSatelliteConfiguration:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: diskful-storage-satellites
spec:
  nodeSelector:
    piraeus.io/storageNode: "yes" 
  storagePools:
    - name: diskful-pool
      lvmThinPool:
        volumeGroup: drbdpool
        thinPool: diskful

I had created the thin LVM pool on my nodes already, but you could just as easily list some physical devices to have LINSTOR do that for you.

1 Like

Now that the xReplicasOnDifferent feature is out in the latest operator release, I figured it was time to revisit this multi-datacenter DRBD setup and share my findings. I hope others experimenting with this feature can also find these insights useful!

Test Environment

  • Cluster Topology: A “stretched” cluster across two main datacenters (DC A & DC B) plus a third “tiebreaker” datacenter (DC C).

  • Hardware: Five CCX33 Hetzner cloud instances.

    • CPU: 8 Dedicated cores
    • RAM: 32GB
    • Network: ~7.05 Gbits/sec. between nodes (tested with iPerf3)
  • Storage Backend: LVM Thin

  • Workload:

    • A single MariaDB:11.4 Pod, with attached DRBD back PVC.
    • Sysbench (oltp_write_only) running in a separate Job, always scheduled on the same node as the database pod.

Benchmark command

# paramerters
sysbench \
 --table-size=100000 \
 --tables=20 \
 --threads=64 \
 --events=100000 \
 --time=5000 \
 oltp_write_only run

Time to start testing! :test_tube: :lab_coat:

First Test: Baseline (2 Replicas, Same DC)

Goal: Get a baseline measurement by having only one extra replica in the same datacenter.

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: lvm-thin
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  linstor.csi.linbit.com/storagePool: vg1-thin #LVM Thin
  linstor.csi.linbit.com/placementCount: "2"
  linstor.csi.linbit.com/resourceGroup: "lvm-thin"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"

Results:

General statistics:
    total time:                          19.8495s
    total number of events:              100000

Latency (ms):
         min:                                    1.25
         avg:                                   12.69
         max:                                  851.76
         95th percentile:                       33.72
         sum:                              1268680.55

Performance: Around 19.85s total time. This serves as the baseline for comparison. I ran the benchmark multiple times, but results were consistent with little variance.

Second Test: 4 Replicas (2 in Each DC) - Protocol C (Sync)

Goal: Measure the impact of fully synchronous replication (Protocol C) across two datacenters. This setup also includes a diskless tiebreaker node in the third DC.

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: high-available
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  linstor.csi.linbit.com/storagePool: vg1-thin #LVM Thin
  linstor.csi.linbit.com/placementCount: "4"
  linstor.csi.linbit.com/resourceGroup: "high-available"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"
  linstor.csi.linbit.com/xReplicasOnDifferent: |
    topology.kubernetes.io/zone: 2

Results:

General statistics:
    total time:                          31.4133s
    total number of events:              100000

Latency (ms):
         min:                                    4.38
         avg:                                   20.08
         max:                                 1227.20
         95th percentile:                       50.11
         sum:                              2008262.40

Performance: About 31.41s total time, roughly 58% slower than the baseline.

Third Test: 4 Replicas (2 in Each DC) - Protocol A (Async)

Goal: Same setup a previous test, but this time compare performance using asynchronous replication between the two primary datacenters. I applied the following NodeConnection to achieve this:

apiVersion: piraeus.io/v1
kind: LinstorNodeConnection
metadata:
  name: selector
spec:
  selector:
    - matchLabels:
        - key: topology.kubernetes.io/region
          op: NotSame
  properties:
    - name: DrbdOptions/Net/protocol
      value: A

Results:

General statistics:
    total time:                          27.3795s
    total number of events:              100000

Latency (ms):
         min:                                    1.29
         avg:                                   17.50
         max:                                 1155.95
         95th percentile:                       49.21
         sum:                              1750079.59

Performance: ~27.38s total time—around 38% slower than the baseline.

Tweaking DRBD options like --al-extents=65534 --sndbuf-size=0 didn’t significantly shift the results.

Thoughts & Questions

I was hoping to see performance closer to the baseline when using asynchronous replication between the DC’s, but there’s still a noticeable slowdown. I guess some overhead is expected when adding more replicas, but I’m curious if there are any tuning parameters or best practices that could help close this gap further.

DRBD Proxy?
Could DRBD Proxy, bring performance closer to the baseline, or is some performance hit inevitable as soon as multiple replicas (even asynchronous ones) are involved?

Methodology & Setup
I tried to keep each test consistent; same Sysbench parameters, same node scheduling, etc. But if you see anything that should be adjusted, let me know. If there’s any additional data or context you’d like me to provide, feel free to ask!

Final Thoughts
In the end, I’m feeling a tiny bit disappointed. Though the performance drop isn’t catastrophic, I was hoping to have all my cake and eat it too! :birthday:

@kermat, @ghernadi, I’d love to hear your insights on whether I might be missing any optimizations or if this is just the expected trade-off. Also, any guidance on diagnosing where the real bottleneck lies would be super helpful!

Either way, this was a lot of fun to test, and I’m looking forward to hearing your thoughts!

1 Like

Very cool, thank you for testing and reporting!

Unfortunately, DRBD tuning cannot change the speed of light :sweat_smile:

Synchronous replication requires replicated writes to be sent to all peers, and must wait for acknowledgments of those replicated writes to come back before the write is considered complete. You’re very likely up against the round trip time between your nodes.

Asynchronous replication only requires that a write be placed into the local TCP send buffer to consider the write complete, but there are some operations that DRBD must send to peers and wait for the ack (like swapping a hot activity log extent) before it can continue. This makes DRBD’s asynchronous mode sometimes synchronous, and will happen more often under random write workloads.

DRBD Proxy should help with this, because it allows DRBD to have more buffer space to place replicated writes while we wait for the synchronous operations to be acknowledged.

1 Like

I agree with Matt, nice testing :slight_smile:

One thing that came up to my mind is that your 1st test, which you take as a base line, is testing with all replicas within the same DC. However none of the other tests share this property/setting, so I would argue that a “1 replica per DC” would more accurate as a base line.
This could also be split up in 2 tests, one with protocol A and one with C, mostly so that we can also see the impact of scaling the replica count on the performance.

1 Like

If I understand you correctly, DRBD Proxy doesn’t need to wait on the operations that DRBD must send to peers and wait for acknowledgment?

@ghernadi, I agree with your suggestion about testing with one replica per DC as a more accurate baseline makes sense. So, I ran a new set of benchmarks focusing on different replication configurations within the same datacenter, just to remove inter-DC latency from the equation.

4 Replicas (Same DC) - All Sync

This setup ensures full synchronous replication within the same datacenter.

Run 1: 20.37s
Run 2: 19.32s
Run 3: 23.95s
-------------
Avg:  ~21.21s

4 Replicas (Same DC) - All Async

Same as above, but with asynchronous replication.

Run 1: 19.56s
Run 2: 21.69s
Run 3: 20.98s
-------------
Avg:  ~20.74s

4 Replicas (Same DC) - 2 Sync, 2 Async

A hybrid setup: 2 replicas using synchronous replication, 2 using asynchronous.

Run 1: 22.00s
Run 2: 24.22s
Run 3: 25.81s
-------------
AVG:  ~24.01s

Observations & Next Steps

  1. Sync vs Async (Same DC): The difference is relatively small (~2-5%), suggesting that synchronous replication within the same datacenter is fairly efficient.
  2. Hybrid Sync/Async Overhead: The mixed setup (2 sync, 2 async) had slightly worse results (~24.01s avg), but the difference isn’t dramatic. It could be due to coordination overhead, or it might just be normal test variance.

Given these results, I’m really curious about how much DRBD Proxy could help in cross-DC setups. Is there any way to trial DRBD Proxy in my setup to measure the difference? I’d love to see how buffering affects performance compared to my current async tests.

Looking forward to your thoughts!

DRBD Proxy maintains two streams between peers, which allows for prioritization of traffic.

You can use this contact form to request a DRBD Proxy trial: Contact Us - LINBIT (add a link to this forum post if you can).

If you have a recent enough kernel/distro (RHEL9 or Ubuntu 24.04), you might be one of the first non-LINBITers to test DRBD Proxy v4 :slight_smile:

2 Likes

Turns out that DRDB proxy is not support by Piraeus. Given that, I’m planning to explore some alternative potential optimizations, especially since I have a fairly fast connection with relatively low latency between my two datacenters.

1 Adjust the default kernel network stack config

# Increase default and maximum buffer sizes
net.core.rmem_default = 16777216  # 16 MB
net.core.wmem_default = 16777216  # 16 MB
net.core.rmem_max = 33554432      # 32 MB
net.core.wmem_max = 33554432      # 32 MB

# TCP read and write memory buffers (min, default, max)
net.ipv4.tcp_rmem = 131072 16777216 33554432  # Min 128 KB, Default 16 MB, Max 32 MB
net.ipv4.tcp_wmem = 131072 16777216 33554432  # Min 128 KB, Default 16 MB, Max 32 MB

# Global TCP memory allocation (in pages, assuming 4 KB page size)
# Limits total TCP memory usage to around 5 GB
net.ipv4.tcp_mem = 131072 655360 1310720  # ~512 MB, ~2.5 GB, ~5 GB

If I understand correctly, this should allow for more buffer space during high I/O bursts, potentially smoothing out temporary spikes in write activity.

2 Configuring Congestion Policies and Suspended Replication

According to the DRBD documentation, congestion policies can help in cases where the connection between DC A & DC B degrades. This could prevent unnecessary replication slowdowns when there are temporary network hiccups.

Example configuration:

resource <resource> {
  net {
    on-congestion pull-ahead;
    congestion-fill 3G;
    congestion-extents 2000;
    ...
  }
  ...
}

Does this seem like a reasonable approach, and do the values I’ve set make sense, or would you recommend any adjustments? I’ll post another update once I’ve completed testing and have some results to share.

1 Like