IO delay spikes on various nodes in the Proxmox cluster causing noticeable performance impacts on the running VMs

The Issue: As the title said, Across multiple nodes, IO delay periodically spikes, sometimes exceeding 10–12%, and during Veeam backup operations it climbs to ~50% and stays high until backups finish. These spikes cause VM performance degradation even when general CPU load is low.

The Setup: 4-node Proxmox cluster, all nodes connected through a dedicated LINSTOR/DRBD network

  1. 3 × Dell PowerEdge R6615
    • Storage: 3 × 20TB HDD (7200 RPM, SATA, 6Gbps)

    • ZFS: RAIDZ1

    • 384 GB DDR5

    • LINSTOR backend: DISKLESS → replicated via DRBD to the storage nodes

  2. 1 × Dell R340
    • Lower RAM

    • No local storage (fully diskless in the pool)

    • Used as more of a testing box, no prod vms run here

Vm’s running include multiple web services, load time are greatly reduced during these spikes.

VMs are backed up by a Veeam appliance.
When backups run, IO delay across cluster nodes rises sharply and stays high until backups complete.

It seems the backup workload is pushing the cluster beyond expected IO capacity. But I dont know how to prove this. But its not my main focus.

The Questions: I’m hoping to get guidance on the following:

  1. How can I reduce the IO spikes?

  2. Where is the most likely bottleneck?

    • Are the 20TB 7200RPM disks too slow for LINSTOR + ZFS… + Veeam?

    • Is RAIDZ1 too slow for heavy read/write workloads?

    • Could DRBD protocol or sync settings be the limiter?

    • Should I switch to mirrors or add SLOG devices?

    • Are diskless clients adding extra overhead?

  3. What tuning or architectural changes would the community recommend?

Any suggestions for:

  • DRBD protocol tuning

  • LINSTOR storage pool configuration

  • ZFS optimizations

  • Recommended layouts for mixed diskless + diskfull nodes

  • Best practices when pairing LINSTOR with Veeam backup loads

DRBD sync is not running during these events—only normal background activity.

I’m happy to provide logs, DRBD configs, LINSTOR resource definitions, ZFS stats, iostat, or anything else helpful.

Thank you in advance for any insight.
I’ve been digging into this for a while and would really value advice from people with more LINSTOR/DRBD tuning experience.