Linstor on Proxmox utilising NVMe-of with RDMA over RoCE

Hi Everyone,

This is my first post to the forum after sucssefully configuring Linstor on a 2 node Proxmox cluster.

I would now like to optimse the networking by implementing NVMe-of with RDMA over RoCE utilsiing Mellanox connectx-4 lx but have not been able to find any offical documentation and or google seaches that outlines exaclly what needs to be done.

Could anyone assit and or point be to a wesbite/guide/documentation on how to enable this?

Thanks

For the NIC: https://enterprise-support.nvidia.com/s/article/howto-configure-nvme-over-fabrics

For Linstor: https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-nvme-layer

But you’d be losing the main benefit (IMO) of Linstor, which is DRBD replication.

NVMe-oF/NVMe-TCP allows LINSTOR to connect diskless resources to a node with the same resource where the data is stored over NVMe fabrics. This leads to the advantage that resources can be mounted without using local storage by accessing the data over the network. LINSTOR is not using DRBD in this case, and therefore NVMe resources provisioned by LINSTOR are not replicated, the data is stored on one node.

Linstor then just becomes a volume manager: e.g. it can create an LVM volume on node A, and you can access it from node B using NVMe-oF or NVMe-TCP.

(Note: I’ve never run Linstor in this way, this is just from reading the documentation)

2 Likes

This tech guide and this blog offer an alternative approach to using LINSTOR with Proxmox.

Neither use the LINSTOR driver to integrate directly with Proxmox, but instead create HA NVMe-oF clusters that you can attach Proxmox to.

did you ever get this working? This is something im currently interested in, but its difficult to understand how to use drbd with nvme/roce

DRBD is just a block device, just like any other block device. You can export the DRBD device as an NVMe-oF target just as you would any other block device. This guide explains one way to build an HA NVMe-oF target using DRBD and this blog describes how you would connect it to Proxmox VE.

If a gap exists between those two resources let us know what specific question you have and we’ll try to answer it here.

So i’ve been doing some more work on this, im most definitely unfamiliar with a lot of this but learning. I’ve seen read through both your guide and the blog and they were great! The only thing I was unsure about is how they worked together with drbd-reactor. The vibes that I was getting is that drbd-reactor seems to be the new preferred way to handle resourcing/quorum with linstor/drbd. Please dont take that statement as fact this is just my very very poor understanding. I wasn’t sure how to bridge the gap between pacemaker and using drbd-reactor and what pros/cons there were to using each (I still may take a more indepth look at pacemaker).

I did eventually learn about linstor-gateway but had issues changing the transport either during or post creation. I think i’m getting fairly close to getting everything together, just need to figure out how to specify rdma instead of tcp… let me know if you have any insight on this!

That’s the vibe we’re going for without being too opinionated. We have a long history of using and supporting Pacemaker at LINBIT. It’s a great project and still use and support it, but for simpler HA clusters or when users don’t have fencing devices but do have DRBD’s quorum available, DRBD Reactor is our preferred cluster resource manager.

If you’re trying to configure RDMA as the transport for DRBD’s replication, you would configure this on the LINSTOR resource-group you’re using to create your NVMe-oF targets:

linstor resource-group drbd-options --transport rdma <resource-group-name>

IF you’re trying to configure RDMA as the transport for NVMe-oF initiators accessing the targets, then you’ll have to manually edit the DRBD Reactor configurations that LINSTOR Gateway creates, because I don’t think LINSTOR Gateway allows you to configure transport types (yet).

For example, you’ve created an NVMe-oF target named example using LINSTOR Gateway, like this:

linstor-gateway nvme create linbit:nvme:example 192.168.222.19/24 2G

On the DRBD Reactor hosts, you will see TOML configurations that look like this:

[root@linstor-sat-1 ~]# cat /etc/drbd-reactor.d/linstor-gateway-nvmeof-example.toml
# Generated by LINSTOR Gateway at 2025-09-19 20:41:33.927884708 +0000 UTC m=+298.631107825
# DO NOT MODIFY!

[[promoter]]

  [promoter.metadata]
    linstor-gateway-schema-version = 1

  [promoter.resources]

    [promoter.resources.example]
      on-drbd-demote-failure = "reboot-immediate"
      runner = "systemd"
      start = [
        "ocf:heartbeat:portblock portblock action=block ip=192.168.222.19 portno=4420 protocol=tcp",
        "ocf:heartbeat:Filesystem fs_cluster_private device=/dev/drbd/by-res/example/0 directory=/srv/ha/internal/example fstype=ext4 run_fsck=no",
        "ocf:heartbeat:IPaddr2 service_ip cidr_netmask=24 ip=192.168.222.19",
        "ocf:heartbeat:nvmet-subsystem subsys nqn=linbit:nvme:example serial=b407226372444776",
        "ocf:heartbeat:nvmet-namespace ns_1 backing_path=/dev/drbd/by-res/example/1 namespace_id=1 nguid=eebdf3eb-546a-55a6-bb63-b24a7d5aabaa nqn=linbit:nvme:example uuid=eebdf3eb-546a-55a6-bb63-b24a7d5aabaa",
        "ocf:heartbeat:nvmet-port port addr=192.168.222.19 nqns=linbit:nvme:example type=tcp",
        "ocf:heartbeat:portblock portunblock action=unblock ip=192.168.222.19 portno=4420 protocol=tcp tickle_dir=/srv/ha/internal/example",
      ]
      stop-services-on-exit = true
      target-as = "Requires"

Editing those configuration files on the DRBD Reactor nodes - which may or may not be a great idea (I’ve not tried to edit a LINSTOR Gateway created DRBD Reactor configurations before) - so that the nvmet-port resource uses type=rdma and then the portblock resources so they reference UDP/4791 (for RoCEv2), might work for you here.

Would be interesting to hear if you can get this stack working or if you’ll need to stand up an HA target using more manual configurations. Either way, a lot of the manual configurations you might need for the latter can be deduced from what you have configured with LINSTOR Gateway.

Ahh, I tried to connect to the drbd block device not long ago but I definitely wasn’t using the right port

I’ll have to give this a try! linstor-gateway definitely a good look at what was possible with drbd-reactor! Manual configs can definitely be a bit of a pain as there’s a lot I dont understand, it might take me another week of messing around in my spare time to get everything working, but i’m excited to get drbd working, I think a mirrored nvme SAN using drbd would be a great storage backend for me to rebuild my homelab upon!

1 Like

Alright… so I think I may have gotten it working with just the config change in drbd-reactorctl, I also did change portblock to udp4791. At least I finally have a kali vm setup and running on a linstor lvm! I was able to get it to migrate to another node as well! I will need to go back and check if these changes persist reboot but I was able to connect almost all the nodes in my cluster to the lvm. I’ve still got a few kinks to handle as far as figuring out how to make the config persist reboot and an issue with one of my nodes not connecting, but as far as a proof of concept goes for my homelab. This looks like this might work!

I’ll need to do some benchmarking as well and I may need to look into some other method of changing the drbd-reactor config. Maybe when a node gets promoted to primary I can also have it copy over the backup config and run drbd-reactorctl disable and drbd-reactorctl enable.

1 Like

I had one more question. Do you know of anyone having issues with TCP Tiebreakers? I’ve been trying to pin down this issue for 3 days or so as I was creating my ansible script to setup my linstor/drbd again.

I have 5 nodes, pve1,pve2,pve3

pvs1 and pvs2 are diskful. They are both running proxmox with linstor installed over it because proxmox is a supported os(and it was easier)

pvs3 is a diskless proxmox VM running as a guest on one of my pve3 nodes. (I plan to run a separate very small ceph cluster between my 3 pve nodes for HA.

So I realized that I probably wasnt using rdma for transport replication between the pvs1-2 nodes so I tried to correct that, because with synchronous replication I assumed it would be a bottleneck. I used variations of commands like

linstor node-connection drbd-peer-options --transport rdma pvs1 pvs2

#linstor resource-connection drbd-peer-options --transport rdma pvs1 pvs2 rg-pve0

I've even tried and all the resource-connection variations as well
#linstor node-connection drbd-peer-options --transport tcp pvs1 pvs3
#linstor node-connection drbd-peer-options --transport tcp pvs2 pvs3
#linstor node-connection drbd-peer-options --transport rdma pvs1 pvs2

but when I reboot one of my diskful nodes, they can connect to each other but not to the diskless(pvs3) node

im still using linstor + linstor-gateway to setup my nvmet target and handle failover to the other mirrored node

I suspect that changing the transport type of pvs1 and pvs2 might be the culprit, but I have no good way to determine this other than when I dont try to change the transport to rdma, the rebooted nodes seems to be able to connect with pvs3

this forum post was a good reference for me, but they dont seem to have my issue.