We are a cloud computing provider currently using OpenStack with a Ceph backend. To achieve higher throughput and lower latency for our block storage, we are seriously evaluating Linstor + DRBD as a high-performance backend for our OpenStack Cinder service.
Our environment is large. Each OpenStack cluster has over 200 compute nodes. Our proposed architecture is to run an HA Linstor Controller and configure all 200+ compute nodes as Linstor Satellites.
Main Question
We saw a notice and recommendation in documentation stating that a cluster “must be 5 nodes.” This has caused some confusion for us.
What does this 5-node limit refer to? Does it perhaps refer to the Linstor Controller’s database quorum (e.g., if using an external etcd cluster)?
Is there a hard limit on the number of Satellite nodes that can connect to a single Linstor Controller?
Can a single Linstor cluster (with an HA controller) effectively and performantly manage 200+ Satellite nodes?
Other Considerations
Given this scale (200+ nodes), what other critical points, potential bottlenecks, or best practices should we consider for this deployment?
For example:
Performance impacts on the Linstor Controller(s)?
Specific network design considerations for DRBD replication traffic at this scale?
Any known scalability limitations with the Linstor Cinder driver itself?
I see that the example cluster in the OpenStack chapter of the LINSTOR User Guide is a 5-node cluster, but I don’t see a stated hard requirement that it “must be 5 nodes”.
It is Monday though. Maybe you can rub my nose into it?
Barring anything related to platform constraints, minimum DRBD quorum requirement would be three nodes.
I mean this part of following drbd document that mentioned:
1.5.7. Maximum number of nodes accessing a resource
There is a limit of 32 nodes that can access the same DRBD resource concurrently. In practice, clusters of more than five nodes are not recommended. DRBD 9.0 en - LINBIT.
There is no “5-node” limit, in fact, the lower node limit is technically one node, but it won’t be a particularly useful “cluster”. 3-node minimum is the recommendation as you’ll gain quorum for each replicated resource.
200-300 nodes is certainly feasible for a single LINSTOR cluster, but please keep in mind it’s a single controller architecture. Even the “HA LINSTOR Controller” deployment is active/passive and will likely show limits approaching 1000+ nodes.
Yes.
The 32-node limitation is the replica count for a single resource. You would never have this higher than 3, where each virtual machine disk image is mapped 1:1 to a LINSTOR (DRBD) resource. The default replica count is 2. The idea is to survive a node failure and still have your data intact.
Larger clusters tend to rely more on diskless attachment, but the same principles apply to 200+ node clusters as they would to 3-node clusters. Individual volumes are replicated amongst the satellites with local storage, you need a network capable of handling the replication I/O overhead, you need storage that can handle the I/O, etc.
Thanks @Ryan , that was great help, another question I forgot to ask earlier, is there any solution for data replication on disk such as drbd if we want to use SPDK + NVMe-oF as a backend
I mean for this part of document:
'#### 3.5.1. NVMe-oF/NVMe-TCP LINSTOR Layer
NVMe-oF/NVMe-TCP allows LINSTOR to connect diskless resources to a node with the same resource where the data is stored over NVMe fabrics. This leads to the advantage that resources can be mounted without using local storage by accessing the data over the network. LINSTOR is not using DRBD in this case, and therefore NVMe resources provisioned by LINSTOR are not replicated, the data is stored on one node.
’
The “NVMe-oF/NVMe-TCP LINSTOR Layer” forgoes DRBD replication, which is (usually) the whole point of LINSTOR. I will take a wild guess that this is probably not what you want. I wouldn’t be surprised if this is one of the least used features in LINSTOR.
DRBD can perform diskless attachment for nodes that do not have local storage. This is likely what you want, and RDMA transport is also supported.