Hello.
I have rather simple kubernetes cluster on baremetal, with 3x controlplane and 3x worker nodes
whole cluster has 2x40G bond directly connected via arista mlag pair
worker nodes are equipped with 2x Xeon Platinum 8173M and 256G of RAM
for datastore, i use 2x samsung 990 pro with in zfs mirror pool
pool: drbd-volumes
state: ONLINE
scan: scrub repaired 0B in 00:00:00 with 0 errors on Sun Dec 14 00:24:01 2025
config:
NAME STATE READ WRITE CKSUM
drbd-volumes ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
errors: No known data errors
manually created zvol has really nice write performance
all tests are with FIO
raw zvol test (ext4 FS, mounted)
4K - fsync=1
write: IOPS=18.3k, BW=71.6MiB/s (75.1MB/s)(4297MiB/60001msec); 0 zone resets
write: IOPS=18.4k, BW=71.7MiB/s (75.2MB/s)(4304MiB/60001msec); 0 zone resets
write: IOPS=18.6k, BW=72.7MiB/s (76.2MB/s)(4363MiB/60001msec); 0 zone resets
write: IOPS=18.7k, BW=72.9MiB/s (76.4MB/s)(4374MiB/60001msec); 0 zone resets
4K - fsync=0
write: IOPS=54.8k, BW=214MiB/s (225MB/s)(12.6GiB/60001msec); 0 zone resets
write: IOPS=54.9k, BW=214MiB/s (225MB/s)(12.6GiB/60001msec); 0 zone resets
write: IOPS=54.9k, BW=215MiB/s (225MB/s)(12.6GiB/60001msec); 0 zone resets
write: IOPS=54.7k, BW=214MiB/s (224MB/s)(12.5GiB/60001msec); 0 zone resets
4M - fsync=1
write: IOPS=815, BW=3261MiB/s (3419MB/s)(191GiB/60001msec); 0 zone resets
write: IOPS=816, BW=3266MiB/s (3424MB/s)(191GiB/60001msec); 0 zone resets
4M - fsync=0
write: IOPS=1039, BW=4157MiB/s (4358MB/s)(244GiB/60002msec); 0 zone resets
write: IOPS=1021, BW=4085MiB/s (4284MB/s)(239GiB/60002msec); 0 zone resets
linstor, with no replica is significantly worse (this fio test if from kubevirt VM with PVC on linstor), especially on 4K, which i find crucial for my workload
4K - fsync=1
write: IOPS=748, BW=2996KiB/s (3068kB/s)(176MiB/60002msec); 0 zone resets
write: IOPS=749, BW=2996KiB/s (3068kB/s)(176MiB/60003msec); 0 zone resets
write: IOPS=741, BW=2965KiB/s (3036kB/s)(174MiB/60001msec); 0 zone resets
write: IOPS=747, BW=2988KiB/s (3060kB/s)(175MiB/60001msec); 0 zone resets
4K - fsync=0
write: IOPS=18.3k, BW=71.6MiB/s (75.1MB/s)(4298MiB/60001msec); 0 zone resets
write: IOPS=20.2k, BW=79.1MiB/s (82.9MB/s)(4746MiB/60001msec); 0 zone resets
write: IOPS=19.4k, BW=75.8MiB/s (79.5MB/s)(4547MiB/60001msec); 0 zone resets
write: IOPS=21.2k, BW=82.7MiB/s (86.7MB/s)(4962MiB/60001msec); 0 zone resets
4M - fsync=1
write: IOPS=552, BW=2209MiB/s (2316MB/s)(129GiB/60002msec); 0 zone resets
write: IOPS=554, BW=2219MiB/s (2327MB/s)(130GiB/60003msec); 0 zone resets
4M - fsync=0
write: IOPS=750, BW=3000MiB/s (3146MB/s)(176GiB/60004msec); 0 zone resets
write: IOPS=755, BW=3022MiB/s (3169MB/s)(177GiB/60004msec); 0 zone resets
with replicas, it gets even worse (logically due to network overhead)
3x replica
4K - fsync=1
write: IOPS=571, BW=2285KiB/s (2340kB/s)(134MiB/60002msec); 0 zone resets
write: IOPS=572, BW=2288KiB/s (2343kB/s)(134MiB/60002msec); 0 zone resets
write: IOPS=571, BW=2287KiB/s (2342kB/s)(134MiB/60002msec); 0 zone resets
write: IOPS=572, BW=2291KiB/s (2345kB/s)(134MiB/60002msec); 0 zone resets
4K - fsync=0
write: IOPS=12.3k, BW=48.0MiB/s (50.3MB/s)(2879MiB/60001msec); 0 zone resets
write: IOPS=12.2k, BW=47.8MiB/s (50.1MB/s)(2868MiB/60001msec); 0 zone resets
write: IOPS=12.3k, BW=47.9MiB/s (50.3MB/s)(2876MiB/60001msec); 0 zone resets
write: IOPS=12.1k, BW=47.4MiB/s (49.7MB/s)(2843MiB/60001msec); 0 zone resets
4M - fsync=1
write: IOPS=62, BW=249MiB/s (262MB/s)(14.6GiB/60001msec); 0 zone resets
write: IOPS=62, BW=250MiB/s (262MB/s)(14.6GiB/60006msec); 0 zone resets
4M - fsync=0
write: IOPS=108, BW=435MiB/s (456MB/s)(25.5GiB/60056msec); 0 zone resets
write: IOPS=108, BW=435MiB/s (456MB/s)(25.5GiB/60053msec); 0 zone resets
i understand that fsync is not good for network overhead, so i dont care that much, but still with fsync=0, its stil pretty bad, compared to raw zfs zvol
here’s my storageclass, where some tuning options are already in place
apiVersion: ``storage.k8s.io/v1
kind: StorageClass
metadata:
name: linstor-nvme-replicated
provisioner: ``linstor.csi.linbit.com
parameters:
csi.storage.k8s.io/fstype:`` “ext4”
linstor.csi.linbit.com/storagePool:`` “linstor-nvme”
linstor.csi.linbit.com/allowRemoteVolumeAccess:`` “false”
linstor.csi.linbit.com/autoPlace:`` “3”
DrbdOptions/Disk/disk-flushes: “no”
DrbdOptions/Disk/md-flushes: “no”
DrbdOptions/Net/max-buffers: “10000”
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
i tried all 3 protocols, tests i’ve sent are with protocol B
protocol A does not make much difference
also i tried to make TCP buffers larger on the workers nodes, it did not help
am i missing something , or is it because of the consumer grade nvmes ?