Linstor slow write with ZFS nvme pool

Hello.

I have rather simple kubernetes cluster on baremetal, with 3x controlplane and 3x worker nodes

whole cluster has 2x40G bond directly connected via arista mlag pair

worker nodes are equipped with 2x Xeon Platinum 8173M and 256G of RAM

for datastore, i use 2x samsung 990 pro with in zfs mirror pool

  pool: drbd-volumes
 state: ONLINE
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Sun Dec 14 00:24:01 2025
config:

	NAME          STATE     READ WRITE CKSUM
	drbd-volumes  ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    nvme0n1   ONLINE       0     0     0
	    nvme1n1   ONLINE       0     0     0

errors: No known data errors

manually created zvol has really nice write performance

all tests are with FIO

raw zvol test (ext4 FS, mounted)

4K - fsync=1
  write: IOPS=18.3k, BW=71.6MiB/s (75.1MB/s)(4297MiB/60001msec); 0 zone resets
  write: IOPS=18.4k, BW=71.7MiB/s (75.2MB/s)(4304MiB/60001msec); 0 zone resets
  write: IOPS=18.6k, BW=72.7MiB/s (76.2MB/s)(4363MiB/60001msec); 0 zone resets
  write: IOPS=18.7k, BW=72.9MiB/s (76.4MB/s)(4374MiB/60001msec); 0 zone resets
4K - fsync=0
  write: IOPS=54.8k, BW=214MiB/s (225MB/s)(12.6GiB/60001msec); 0 zone resets
  write: IOPS=54.9k, BW=214MiB/s (225MB/s)(12.6GiB/60001msec); 0 zone resets
  write: IOPS=54.9k, BW=215MiB/s (225MB/s)(12.6GiB/60001msec); 0 zone resets
  write: IOPS=54.7k, BW=214MiB/s (224MB/s)(12.5GiB/60001msec); 0 zone resets
4M - fsync=1
  write: IOPS=815, BW=3261MiB/s (3419MB/s)(191GiB/60001msec); 0 zone resets
  write: IOPS=816, BW=3266MiB/s (3424MB/s)(191GiB/60001msec); 0 zone resets
4M - fsync=0
  write: IOPS=1039, BW=4157MiB/s (4358MB/s)(244GiB/60002msec); 0 zone resets
  write: IOPS=1021, BW=4085MiB/s (4284MB/s)(239GiB/60002msec); 0 zone resets

linstor, with no replica is significantly worse (this fio test if from kubevirt VM with PVC on linstor), especially on 4K, which i find crucial for my workload

4K - fsync=1
write: IOPS=748, BW=2996KiB/s (3068kB/s)(176MiB/60002msec); 0 zone resets
write: IOPS=749, BW=2996KiB/s (3068kB/s)(176MiB/60003msec); 0 zone resets
write: IOPS=741, BW=2965KiB/s (3036kB/s)(174MiB/60001msec); 0 zone resets
write: IOPS=747, BW=2988KiB/s (3060kB/s)(175MiB/60001msec); 0 zone resets
4K - fsync=0
write: IOPS=18.3k, BW=71.6MiB/s (75.1MB/s)(4298MiB/60001msec); 0 zone resets
write: IOPS=20.2k, BW=79.1MiB/s (82.9MB/s)(4746MiB/60001msec); 0 zone resets
write: IOPS=19.4k, BW=75.8MiB/s (79.5MB/s)(4547MiB/60001msec); 0 zone resets
write: IOPS=21.2k, BW=82.7MiB/s (86.7MB/s)(4962MiB/60001msec); 0 zone resets
4M - fsync=1
write: IOPS=552, BW=2209MiB/s (2316MB/s)(129GiB/60002msec); 0 zone resets
write: IOPS=554, BW=2219MiB/s (2327MB/s)(130GiB/60003msec); 0 zone resets
4M - fsync=0
write: IOPS=750, BW=3000MiB/s (3146MB/s)(176GiB/60004msec); 0 zone resets
write: IOPS=755, BW=3022MiB/s (3169MB/s)(177GiB/60004msec); 0 zone resets

with replicas, it gets even worse (logically due to network overhead)

3x replica

4K - fsync=1
write: IOPS=571, BW=2285KiB/s (2340kB/s)(134MiB/60002msec); 0 zone resets
write: IOPS=572, BW=2288KiB/s (2343kB/s)(134MiB/60002msec); 0 zone resets
write: IOPS=571, BW=2287KiB/s (2342kB/s)(134MiB/60002msec); 0 zone resets
write: IOPS=572, BW=2291KiB/s (2345kB/s)(134MiB/60002msec); 0 zone resets
4K - fsync=0
write: IOPS=12.3k, BW=48.0MiB/s (50.3MB/s)(2879MiB/60001msec); 0 zone resets
write: IOPS=12.2k, BW=47.8MiB/s (50.1MB/s)(2868MiB/60001msec); 0 zone resets
write: IOPS=12.3k, BW=47.9MiB/s (50.3MB/s)(2876MiB/60001msec); 0 zone resets
write: IOPS=12.1k, BW=47.4MiB/s (49.7MB/s)(2843MiB/60001msec); 0 zone resets
4M - fsync=1
write: IOPS=62, BW=249MiB/s (262MB/s)(14.6GiB/60001msec); 0 zone resets
write: IOPS=62, BW=250MiB/s (262MB/s)(14.6GiB/60006msec); 0 zone resets
4M - fsync=0
write: IOPS=108, BW=435MiB/s (456MB/s)(25.5GiB/60056msec); 0 zone resets
write: IOPS=108, BW=435MiB/s (456MB/s)(25.5GiB/60053msec); 0 zone resets

i understand that fsync is not good for network overhead, so i dont care that much, but still with fsync=0, its stil pretty bad, compared to raw zfs zvol

here’s my storageclass, where some tuning options are already in place

apiVersion: ``storage.k8s.io/v1
kind: StorageClass
metadata:
name: linstor-nvme-replicated
provisioner: ``linstor.csi.linbit.com
parameters:
csi.storage.k8s.io/fstype:`` “ext4”
linstor.csi.linbit.com/storagePool:`` “linstor-nvme”
linstor.csi.linbit.com/allowRemoteVolumeAccess:`` “false”
linstor.csi.linbit.com/autoPlace:`` “3”
DrbdOptions/Disk/disk-flushes: “no”
DrbdOptions/Disk/md-flushes: “no”
DrbdOptions/Net/max-buffers: “10000”
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

i tried all 3 protocols, tests i’ve sent are with protocol B

protocol A does not make much difference

also i tried to make TCP buffers larger on the workers nodes, it did not help

am i missing something , or is it because of the consumer grade nvmes ?

i created DRBD volume, 3x replicated, attached directly to worker nodes as /dev/drbd1000 to make sure that QEMU is not an bottleneck

4K - fsync=1
write: IOPS=8072, BW=31.5MiB/s (33.1MB/s)(1024MiB/32473msec); 0 zone resets
4K - fsync=0
write: IOPS=8651, BW=33.8MiB/s (35.4MB/s)(1024MiB/30300msec); 0 zone resets
4M - fsync=1
write: IOPS=121, BW=485MiB/s (508MB/s)(1024MiB/2113msec); 0 zone resets
4M - fsync=0
write: IOPS=127, BW=508MiB/s (533MB/s)(1024MiB/2014msec); 0 zone resets

looks like it ain’t

Another question, i have old ConnectX-3 NICs, they might be the bottleneck here?

What do your fio test parameters look like?

What do your PVC manifests look like?

Any chance to put DRBD metadata on some other device?