Linstor DRBD on Proxmox Backup Server

Hello everyone.

I have a « small » question about Linstor’s DRBD installation.

Few years ago, I made a 3-node PVE cluster which was using 2 PVE node as the virtual machine hosts and 1 last PVE node for backups.

The backup node is the Linstor controller (diskless) and the other 2 are Linstor satellites. (SSDs)

Recently, I’ve been looking at Proxmox Backup Server and was wondering if it could replace the PVE backup node.

Is it possible to properly install Linstor DRBD and make a Linstor cluster if I’m using PBS on the 3rd node ? It will be declared as « diskless », and used only for the vote.

Also, I must specify that PVE + PBS cohabitation on the 3rd node isn’t possible, only PBS alone.

Thank you in advance for your time and help.

If you need more information, I will do my best to provide it.

Kind regards,

Linstor doesn’t care what other software is running on the satellite nodes, and if you configure the third node without any storage pools, it will only use it as diskless.

Proxmox VE probably doesn’t care that there’s a third Linstor node. I guess it would see it if it asks the Linstor API for a list of all nodes. But it’s never going to try to place a VM there, since it’s not a Proxmox node.

So I don’t see any problem with running a satellite on a PBS server, just for the tiebreaker role.

1 Like

Hi.

First of all, thank you for your fast reply.

Before reading your answer, I thought about it a little more. What could possibly prevent the PBS node to be the Linstor controller ?

For me, it just comes down to adding LINBIT’s repository, and installing the required packages on the nodes, right ? And maybe even the PBS “controller” node doesn’t need the drbd-dkms/utils packages.

So I tried it while filtering some of my older commands that were originally for a PVE node and it worked.

I suppose that as long as you have a debian base that is the same version as the added LINBIT repository, it should work almost as intended. (The PBS’s debian version probably must match the “proxmox-$PVERS” variable while adding the repository, right now it’s PVERS=8 for debian 12 PBS 3 is debian 12 too)

If I notice some weird things happening later, I’ll update in this post.

Kind regards,

I think those packages are required, because the satellite needs to create /dev/drbdXXXX devices which communicate with those on the other two nodes - even though they don’t directly manage any local disks - if you want the tiebreaker (anti split-brain) functionality.

Whether it’s a “controller” or not is another issue. You could run the controller on any of the three nodes; its node type would be set as “combined” rather than “satellite” in linstor node create.

1 Like

Your statement seems to be absolutely right.

I just checked on my older “backup node” which is a PVE as the linstor “controller” and it does have /var/lib/linstor.d/.backup/ populated with resources from the other 2 nodes even as a diskless member of the linstor cluster.

I will now try to just add the required packages on the PBS node and see if a delete/add of the test resource will create these files.

I need to recreate the Linstor cluster properly (linstor node create…) because it seems that it saved the node’s capability at its creation :

Details:
    Supported storage providers: [diskless, lvm, lvm_thin, zfs, zfs_thin, file, file_thin, remote_spdk, ebs_init, ebs_target]
    Supported resource layers  : [nvme, writecache, cache, storage]
    Unsupported storage providers:
        SPDK: IO exception occured when running 'rpc.py spdk_get_version': Cannot run program "rpc.py": error=2, No such file or directory
        EXOS: IO exception occured when running 'lsscsi --version': Cannot run program "lsscsi": error=2, No such file or directory
        STORAGE_SPACES: This tool does not exist on the Linux platform.
        STORAGE_SPACES_THIN: This tool does not exist on the Linux platform.

    Unsupported resource layers:
        DRBD: DRBD version has to be >= 9. Current DRBD version: 8.4.11
        LUKS: IO exception occured when running 'cryptsetup --version': Cannot run program "cryptsetup": error=2, No such file or directory
        BCACHE: IO exception occured when running 'make-bcache -h': Cannot run program "make-bcache": error=2, No such file or directory

Now I have another problem that could be because of the fact that I installed it in the “wrong” order ?

DRBD version on the PBS node is detected as 8.4.11, however I did install the right version :

drbd-dkms/unknown,now 9.2.14-1 all [installed]
drbd-utils/unknown,now 9.31.0-1 amd64 [installed]

For now, I must find why because the resources won’t be created on the PBS node if it’s not registered as [DRBD] “capable”.

/var/lib/linstor.d/.backup/

Interesting, I didn’t know about that until now :slight_smile:

DRBD version on the PBS node is detected as 8.4.11, however I did install the right version :frowning:

Try cat /sys/module/drbd/version. If it’s wrong, then rmmod drbd and modprobe drbd. If that fixes the version then restart the satellite. Recheck linstor node info -f.

If necessary linstor node reconnect ..., or at worst, linstor node delete and linstor node create to re-add the node.

1 Like

I knew about rmmod drbd and modprobe drbd but it didn’t work this time, weird.

I don’t know if it’s because I tried to apt install pve-headers-$(uname -r) on the PBS, then apt -y install --reinstall drbd-dkms drbd-utils and finally another rmmod drbd ; modprobe drbd and it worked :
version: 9.2.14 (api:2/proto:118-123)
GIT-hash: a1e7c10e591a844b327da120d169df7da7c933b7 build by root@backup, 2025-07-29 13:38:40
Transports (api:21): tcp (9.2.14)

I will probably reinstall the nodes cleanly later as this is just some tests to get everything working as intended.
I’m kinda worried about the PBS showing a “pve” kernel now :
uname -r -> 6.8.12-12-pve or was it always the case ?

After a little check on the actual kernel on my other PBS node, which is 100% alone from PVE environment, is also a “pve” one :
6.8.12-11-pve
I’m kinda relieved that it’s normal and they are just shipping “-pve” named kernel for PBS.

Regarding the linstor cluster ; after deleting everything old with linstor commands (nodes, resource-definition, resource-group…), I created the cluster again and now the PBS backup node is able to DRBD and in fact Tie-Breaking with my test resource :

╭────────────────────────────────────────────────────────────────────────╮
┊ Node   ┊ DRBD ┊ LUKS ┊ NVMe ┊ Cache ┊ BCache ┊ WriteCache ┊ Storage ┊
╞════════════════════════════════════════════════════════════════════════╡
┊ backup ┊ +    ┊ -    ┊ +    ┊ +     ┊ -      ┊ +          ┊ +       ┊
┊ srv1   ┊ +    ┊ -    ┊ +    ┊ +     ┊ -      ┊ +          ┊ +       ┊
┊ srv2   ┊ +    ┊ -    ┊ +    ┊ +     ┊ -      ┊ +          ┊ +       ┊
╰────────────────────────────────────────────────────────────────────────╯
root@backup:~# linstor r l
╭─────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Layers       ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pm-61dae041  ┊ backup ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-07-29 14:32:32 ┊
┊ pm-61dae041  ┊ srv1   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-07-29 14:32:32 ┊
┊ pm-61dae041  ┊ srv2   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-07-29 14:32:33 ┊
┊ pm-d2da2116  ┊ backup ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-07-29 14:32:26 ┊
┊ pm-d2da2116  ┊ srv1   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-07-29 14:32:26 ┊
┊ pm-d2da2116  ┊ srv2   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-07-29 14:32:27 ┊
┊ pm-fa6753cf  ┊ backup ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-07-29 14:32:29 ┊
┊ pm-fa6753cf  ┊ srv1   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-07-29 14:32:29 ┊
┊ pm-fa6753cf  ┊ srv2   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-07-29 14:32:30 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────╯

As we can see, the backup node also created the resource files (1 resource per Windows VM disk):

root@backup:~# ls -la /var/lib/linstor.d/
total 24
drwxr-xr-x  3 root root 4096 Jul 29 14:32 .
drwxr-xr-x 33 root root 4096 Jul 29 12:15 ..
drwxr-xr-x  2 root root 4096 Jul 29 14:32 .backup
-rw-r--r--  1 root root    0 Jul 29 14:32 loop_device_mapping
-rw-r--r--  1 root root 2010 Jul 29 14:32 pm-61dae041.res
-rw-r--r--  1 root root 2010 Jul 29 14:32 pm-d2da2116.res
-rw-r--r--  1 root root 2010 Jul 29 14:32 pm-fa6753cf.res

root@backup:~# ls -la /var/lib/linstor.d/.backup
total 20
drwxr-xr-x 2 root root 4096 Jul 29 14:32 .
drwxr-xr-x 3 root root 4096 Jul 29 14:32 ..
-rw-r--r-- 1 root root 2010 Jul 29 14:32 pm-61dae041.res
-rw-r--r-- 1 root root 2010 Jul 29 14:32 pm-d2da2116.res
-rw-r--r-- 1 root root 2010 Jul 29 14:32 pm-fa6753cf.res

Welcome to the forums.

Looks like you’re on the right track. LINSTOR needs at least two diskful nodes with storage, the 3rd node can be exclusively diskless. This means it is also a LINSTOR satellite node, and DRBD is required even if there isn’t local storage. This 3rd node is also a good candidate to run the LINSTOR controller as a combined LINSTOR node.

Your cluster configuration is exactly what we refer to as the “minimal” LINSTOR deployment pattern, you just so happen be using PBS on your 3rd node.

One thing to mention about kernels, it is always recommended to run the same kernels throughout a cluster if possible (not just the same DRBD kernel module versions). If the “PVE kernel” is easily installable on the PBS node, even if you have to add a Proxmox repo to make the same kernel available on PBS, that’s what I would recommend.

1 Like

Hi and thank you Ryan :slight_smile:

I will gladly remember the kernel advice.

I’m now going to try to make the SSL/TLS encryption work for satellites communication and also look into NIC failover.

For example : all of my 3 nodes have 2 NIC, 1 WAN and 1 LAN. If the LAN NIC goes down, I want the cluster to stay alive by switching the controller/satellites communications that were on the LAN NIC, to the WAN NIC. Even if the WAN NIC is several times slower, it should be better than risking having the cluster going down.

To keep things clear, I will make another post about it and if it is permitted by the forum rules, link it in this reply for the “post continuity”.

Kind regards,

1 Like

Hello everyone again.

I have another question regarding my setup.
I’m actively trying to grasp what could be the worst case scenario here and how could things go wrong (data corruption as an example) in order to understand it better and protect myself as much as possible.

First, here is a summary of my setup :

  • 3 server in total → 2 PVE server + 1 PBS server as a Qdevice.

Each server has 1 LAN IPv4 address and 2 WAN IP address (1 IPV4 + 1 IPV6).

On the Proxmox cluster side

Given that the Qdevice can only use one IP address, I use the LAN IP for the PBS and for the 2 PVE nodes, I specified “just in case” 3 links for the pve-cluster :

  • LAN IP → link 0
  • WANv4 → link 1
  • WANv6 → link 2

I don’t know if I’m already too paranoid as it could be extremely rare that the IPv4/IPv6 stack "of the WAN nic” goes down, but also if it’s going down, is it only for one NIC in particular or is it for the whole server ?

Question 1. I am not sure if it’s useful to specify both NICs (3 IPs) for the pve-cluster or if it could make things worse Proxmox-HA wise.
Couldn’t it be better to only use LAN NIC for the 2 PVE-nodes cluster as the PBS’s Qdevice can only use the LAN NIC ?
The way I understand it, is that by using this setup, if one node’s corosync don’t send packets to other nodes, it will make the server reboot itself after 2-3 minutes and the VM will obey to the HA settings. (For me, it’s by default, so “conditional”)

But with multiple net link (as “failover”) this won’t happend right ?

On the Linstor DRBD cluster side

I created the Linstor cluster by using each LAN IP address of the nodes, and after following the documentation, I did setup the linstor-controller HA resource (on the PVE-nodes) as the PBS is a Diskless TieBreaker.

I tested the shared storage by creating a test VM and it worked, the resource got created on all nodes and the PBS is Tiebreaking :

root@gx-srv1:~# linstor r l
╭─────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node      ┊ Layers       ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════╡
┊ linstor_db   ┊ gx-backup ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-08-06 16:12:31 ┊
┊ linstor_db   ┊ gx-srv1   ┊ DRBD,STORAGE ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2025-08-06 16:12:31 ┊
┊ linstor_db   ┊ gx-srv2   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-08-06 16:12:31 ┊
┊ pm-f99db4df  ┊ gx-backup ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2025-08-11 10:02:29 ┊
┊ pm-f99db4df  ┊ gx-srv1   ┊ DRBD,STORAGE ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2025-08-11 10:02:28 ┊
┊ pm-f99db4df  ┊ gx-srv2   ┊ DRBD,STORAGE ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2025-08-11 10:02:29 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────╯

Linstor DRBD tests
Then I wanted to try and see what could happend if I simulate a failure by switching-off the LAN NIC on “srv1” node by using ip link set dev lanNIC down.

I will try to explain the situation so there will most likely be misunderstandings about how it works, feel free to correct me so I will take advantage of this opportunity to learn.

After executing the command on srv1, the linstor-controller is now working on the srv2 node as intended.

Few minutes later, the PVE-gui shows that the VM is still on the “srv1” node :

The drbdstorage informations aren’t available from both nodes.
srv1 → it’s timing out, and it’s normal because the LAN NIC is down. (both pve-nodes LAN NIC ip address is specified in /etc/pve/storage.cfg in the controller argument)
srv2 → Usage N/A.

Linstor DRBD wise, the nodes looks like this :

root@gx-srv2:~# linstor n l
╭────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node      ┊ NodeType ┊ Addresses                ┊ State                                        ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ gx-backup ┊ COMBINED ┊ 172.16.0.1:3366 (PLAIN) ┊ Online                                       ┊
┊ gx-srv1   ┊ COMBINED ┊ 172.16.0.2:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2025-08-11 11:31:05) ┊
┊ gx-srv2   ┊ COMBINED ┊ 172.16.0.3:3366 (PLAIN) ┊ Online                                       ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────╯

This is normal because the srv1 node can’t communicate anymore with the others and vice-versa so it will be evicted automatically.

Now for the resources :

root@gx-srv2:~# linstor r l
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node      ┊ Layers       ┊ Usage  ┊ Conns               ┊      State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ linstor_db   ┊ gx-backup ┊ DRBD,STORAGE ┊ Unused ┊ Connecting(gx-srv1) ┊ TieBreaker ┊ 2025-08-06 16:12:31 ┊
┊ linstor_db   ┊ gx-srv1   ┊ DRBD,STORAGE ┊        ┊                     ┊    Unknown ┊ 2025-08-06 16:12:31 ┊
┊ linstor_db   ┊ gx-srv2   ┊ DRBD,STORAGE ┊ InUse  ┊ Connecting(gx-srv1) ┊   UpToDate ┊ 2025-08-06 16:12:31 ┊
┊ pm-f99db4df  ┊ gx-backup ┊ DRBD,STORAGE ┊ Unused ┊ Connecting(gx-srv1) ┊ TieBreaker ┊ 2025-08-11 10:02:29 ┊
┊ pm-f99db4df  ┊ gx-srv1   ┊ DRBD,STORAGE ┊        ┊                     ┊    Unknown ┊ 2025-08-11 10:02:28 ┊
┊ pm-f99db4df  ┊ gx-srv2   ┊ DRBD,STORAGE ┊ Unused ┊ Connecting(gx-srv1) ┊   UpToDate ┊ 2025-08-11 10:02:29 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@gx-srv2:~# linstor v l
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource    ┊ Node      ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊      State ┊ Repl           ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ linstor_db  ┊ gx-backup ┊ DfltDisklessStorPool ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊           ┊ Unused ┊ TieBreaker ┊ Established(1) ┊
┊ linstor_db  ┊ gx-srv1   ┊ pve-sp               ┊     0 ┊    1000 ┊ None          ┊           ┊        ┊    Unknown ┊                ┊
┊ linstor_db  ┊ gx-srv2   ┊ pve-sp               ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 23.50 MiB ┊ InUse  ┊   UpToDate ┊ Established(1) ┊
┊ pm-f99db4df ┊ gx-backup ┊ DfltDisklessStorPool ┊     0 ┊    1004 ┊ /dev/drbd1004 ┊           ┊ Unused ┊ TieBreaker ┊ Established(1) ┊
┊ pm-f99db4df ┊ gx-srv1   ┊ pve-sp               ┊     0 ┊    1004 ┊ None          ┊           ┊        ┊    Unknown ┊                ┊
┊ pm-f99db4df ┊ gx-srv2   ┊ pve-sp               ┊     0 ┊    1004 ┊ /dev/drbd1004 ┊  8.51 GiB ┊ Unused ┊   UpToDate ┊ Established(1) ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ 

drbdadm status from the 2 PVE-nodes :

root@gx-srv1:~# drbdadm status
linstor_db role:Secondary
  disk:UpToDate quorum:no open:no
  gx-backup connection:Connecting
  gx-srv2 connection:Connecting

pm-f99db4df role:Primary
  disk:UpToDate quorum:no open:yes
  gx-backup connection:Connecting
  gx-srv2 connection:Connecting
root@gx-srv2:~# drbdadm status
linstor_db role:Primary
  disk:UpToDate open:yes
  gx-backup role:Secondary
    peer-disk:Diskless
  gx-srv1 connection:Connecting

pm-f99db4df role:Secondary
  disk:UpToDate open:no
  gx-backup role:Secondary
    peer-disk:Diskless
  gx-srv1 connection:Connecting

As for now, the test VM is still functionnal on the srv1 node. It can’t be accessed from the internet because the VM gets its IP from the pve-host LAN NIC, that seems to not be a problem, it’s the way it works with our server provider.

Still, the VM is considered down from my point of view because for example, the websites aren’t accessible anymore.

Here is what I would like a big clarification :

As I understand it, the PVE HA doesn’t care if the VM has internet or not, it only cares about the VM
status (is it started, stopped…).
In this case, the VM is started, and the pve-cluster has quorum (not entirely offline because the WAN link is still working), srv1 is not “marked” as offline (probably because of the multiple net links), so it will not try to move the VM to the other node.

From the VM point of view, the drbdstorage is working.

From Linstor DRBD side, srv1’s storage, even if it can’t communicate with other nodes, will continue to work but the other node won’t receive new data because srv1’s LAN NIC is down.

Question 2. if srv1’s LAN NIC is down, srv1 cannot communicate/send new data blocks to the other node, will there be some data loss ? I think that yes. Because there is bound to be data that is still being written even if the VM is no longer accessible from the internet
Also, if srv1 suddenly stops after having the LAN NIC down, RAM will be cleared from the shutdown/reboot and all this data will probably be lost.
Finally, as long as srv1’s LAN NIC is not functional again or srv1 is not restarted/shut down in any way, will the VM never be moved to the other node ?

I hope that all of it is clear enough. Tell me if it’s not, I will try to clarify even more.

Thank you again for your help and advices.

Kind regards,

Edit : I read again and again about PVE’s fencing. From my tests, It seems that if the corosync.conf only has one net link per node and I shutdown the net link, the softdog/ha-manager will initiate a reboot of the node and the resource will be moved to the other node.

If I declare another net link (link0 and link1) and shutdown the link0 again, nothing will happend HA wise.

It seems that for my use case, it’s either the pve-cluster only uses 1 netlink, and Linstor DRBD will be setup the same, or I find a way to use multiple net links in the pve-cluster and multiple links in the linstor-cluster.

I hope that it will not add too much complexity and work well. I’ll take a look at https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-managing_network_interface_cards .

It sounds like you’re in a situation where you have multiple network paths used by Proxmox, but you are simulating a storage layer network failure.

Yes, the VM continues to run on the Proxmox node, the DRBD resource (pm-f99db4df) continues to be in a primary state and shows a loss of quorum, just as expected.

In this case, from a Proxmox VM management perspective, nothing is wrong. The VM continues to run. From a LINSTOR management perspective you need to fix your network issue, or make a decision to stop and migrate the VM elsewhere if needed (yes, you would lose what’s in memory and disk on the current VM in this case).

If you simply fix your networking issue (up the link), the DRBD resource would sync any recent changes (writes) back to the peers, quorum would be reestablished on the primary, everything would be back to normal.

In a true “failover scenario” - If you stopped and started the VM on your other Proxmox node, that resource would now be primary, but since it had quorum the entire time, I believe when the former primary reconnects it will discard its data (the differences between both primaries), syncs changes from the new primary, and is obviously demoted to secondary.

If you have multiple NICs between servers I suggest looking into using network bonding.

2 Likes

Thank you Ryan for your answer.

May you please correct me If I’m wrong (and I will probably be) :

  • There is no “simple” straightforward way to specify failover NIC/IP Address in Linstor DRBD like with corosync while creating the PVE cluster ?
    The idea would be to create the Linstor nodes, then specify for each node a “secondary” NIC/IP to use in case the “default” link is down.

Concerning your suggestion
I actually don’t know if it’s doable with my setup because the servers are from the hosting providers OVH ; So I don’t have control over the network. The “NICs” are actually linux bridges as vmbr0 for WAN and vmbr1 for LAN. I will actively look into this.

  • By using Linux bonding, as I understand it, it’s like having for each node, a “virtual”/”floating” IP which is backed by the 2 NICs. (in my case, the WAN & LAN)

I then create the Linstor cluster with those IPs and they’ll do the failover/routing themselves.
If it works that way, if a NIC is down for whatever reason, there is the second one available and of course if both NICs are down, the server will probably reboot because of the softdog and the other PVE node will switch to primary.

  • Finally, if I’m not wrong with everything, could bonding be used to palliate to the fact that you can only setup one IP address with the Qdevice ?

Again, thank you so much for your time and help. I appreciate it and learn every day.

Kind regards.

The problem with this approach is LINSTOR (DRBD really) will not handle a link layer failure (unplugging a cable from a NIC for example), but network bonding will handle that event just fine.

By using network bonding you still have a “normal” IP address, just redundant links. The routing never changes if you lose a link, the networking still functions the same way as before, you’re just using bonding.

Bonding can be a way to skip the need for configuring redundant links in Corosync. If a qdevice truly only supports one IP address, bonding can be used to make that link more resilent, yes.

1 Like

That seems convenient.

I’ll gladly look into link bonding and put an update. Hope it can be done while not having control over the network.

Thanks !

1 Like

Hello, I’m back.

I looked into Linux Bonding and from my understanding, I cannot use it with my setup.
The NICs are in totally different networks so it’s just not feasible.

As an example, for one of my servers :

  • eth0 (vmbr0) is WAN and with a public IP : 141.XX.XX.XX
  • eth1 (vmbr1) is LAN (OVH’s private network named vRACK) with a private IP in the 172.16.0.0/12 subnet.

I then looked in a weird workaround by trying to setup wireguard on both srv1 and srv2, which was supposed to use vmbr0 and vmbr1 but that is not working either. (It was a test in order to have 1 IP address for each node → srv1 & srv2, that can pass through both NICs)

Aftert that, I checked keepalived and it does not seems to do what I want. (or isn’t just supposed to because my original idea is weird)

It could be a total utopia with this setup but what I want/need is :

Each server from my 3 node cluster has 2 NICs (I don’t count the PBS server as it will not host VMs).
srv1 and srv2 both have vmbr0 which is their WAN IP and vmbr1 is their vRACK LAN IP.

The vRACK network is around 25 Gigs/s and the WAN is 3 Gigs/s.

For the proxmox cluster and Linstor cluster/DRBD, I want the server to use the fastest link, so the vRACK one. If the vRACK link (vmbr1) goes down, I want the proxmox cluster and DRBD to notice it and fallback to use the WAN link. (vmbr0)
The idea is that yes, this link is slower, but at least the clusters and the nodes are still alive.

For the “Proxmox” Cluster, the PBS Qdevice can only use one TCP/IP link so it’s using the vmbr1 one. If it goes down, it’s down and cannot use vmbr0 as a failover link as Corosync could.

As I’m writing this, I’m thinking that maybe my whole idea is just impossible as anything would need more NICs or having control over the network would resolve this.

I will probably just be able to use the vmbr1 for everything (proxmox’s quorum and linstor cluster) and if it goes down, the node will reboot and the second node will be promoted as primary, vice-versa.

Maybe if you have an idea or anything to propose, else I will do that :confused:

Either way, thank you very much for your time and help.

Kind regards,

1 Like

This is likely the best approach moving forward. If you’re deployed inside a colocation environment without true control over your networking configuration, and only one link has access to the WAN, then this sounds more than acceptable.

Sometimes the simplified approach is better? From what you’ve mentioned, it sounds like your vRACK network should be reliable for both LINSTOR and Proxmox.

1 Like

Hi,

Thank you, your comment comes at the right time and reinforces my decision.
I must not worry any longer for something I have no control over.

I’ve looked into node-connection and resource-connection for future-proofs setup in case we need buffier servers with more than 2 NICs. Also I have seen that “linstor path-mesh” is a feature that is coming later and it will simplify this implementation by a lot.

Kind regards,

1 Like