Poor performance with possibly stupid setup

Iโ€™m new to Linstor and trying to make a 2 node proxmox setup with as much redundancy as I can within my cluster size constraints.

Both nodes have 2x mirrored nvme drives with an lvm that is then used by DRBD.

The nodes have a 25Gb link directly between them for DRBD replication. But the servers also have a 1Gb interface (management and proxmox quorum), and a 10Gb interface (NAS, internet, and VM migration).

I would like to use the 10Gb interface as a failover in case the direct link goes down for some reason, but it should not usually be used by DRBD. I couldnโ€™t find a way to do this properly with DRBD networks. So, Iโ€™ve created a primary/backup bond in Linux and use the bond interface for DRBD. That way Linux handles all failover logic.

On my NAS (truenas) I have a VM that will be a diskless witness. This VM has a loop back interface with an ip on the DRBD network, but uses static routes to route that traffic over either the 1Gb interface or the 10Gb interface. This way itโ€™s also protected from a single link failure.

My problem is that when trying to move a VM disk over to the DRBD storage for testing, the performance is horrible. Looking at the network interfaces, it starts out at around 3Gb, but soon drops to around 1Gb or lower. Doing a iperf3 test gives 24Gb (with MTU 9000), so itโ€™s not a network problem. I also have the same issue if I remove the witnesses, so thatโ€™s not the cause either.

Is it just my whole implementation thatโ€™s stupid? Which config files or logs would be most useful for debugging this?

Moving a disk from where to where? From local disk to DRBD? From TrueNAS to DRBD? Something else?

You say โ€œlooking at the network interfacesโ€. Which network interfaces are you talking about - the 25G DRBD network?

By default, DRBD does rate-limiting of its initial sync (to avoid taking up too much disk and network resource), and itโ€™s possible you are hitting that, although Iโ€™m not sure it should affect โ€œnormalโ€ application I/O such as copying data into the disk.

From the drbd.conf man page:

The c-max-rate parameter limits the maximum bandwidth used by dynamically controlled resyncs. Setting this to zero removes the limitation (since DRBD 9.0.28). It should be set to either the bandwidth available between the DRBD hosts and the machines hosting DRBD-proxy, or to the available disk bandwidth. The default value of c-max-rate is 102400, in units of KiB/s.

That makes about 839Mbps by default.

Itโ€™s possible to tweak this. Experiment with these resource group properties (here targetting 20Gbps):

# linstor rg lp blah
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”Š Key                                            โ”Š Value   โ”Š
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”Š DrbdOptions/Net/protocol                       โ”Š C       โ”Š
โ”Š DrbdOptions/PeerDevice/c-max-rate              โ”Š 2400000 โ”Š
โ”Š DrbdOptions/PeerDevice/c-plan-ahead            โ”Š 0       โ”Š
โ”Š DrbdOptions/PeerDevice/resync-rate             โ”Š 2400000 โ”Š
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Use drbdadm dump to check they are applied to your resources (you might need to reboot):

...
    connection {
        host vm11         address         ipv4 10.0.1.241:7032;
        host vm10         address         ipv4 10.0.1.240:7032;
        net {
            allow-two-primaries yes;
            max-buffers  9000;
            max-epoch-size 10000;
            protocol       C;
        }
        disk {
            c-max-rate   2400000;
            c-plan-ahead   0;
            resync-rate  2400000;
        }
    }
...

Also, you could try creating a new DRBD disk on your VM, waiting for it to sync, and then testing its performance using dd oflag=direct, or a tool like bonnie++, iozone etc. This may be more realistic for real-world performance than the initial import.

Apart from that, your setup is complicated. If that doesnโ€™t solve your issue, I think you might first want to simplify it to bare bones for performance debugging. Log files arenโ€™t going to help you here.

Good luck!

1 Like