Question about 2.15. Disk Error Handling Strategies (from the User Guide)

Sorry about the long gap in communication. I really appreciate the input, but got tied up. The problem I’m encountering with using ZFS below DRBD relates to storage consumption. When I set up the system the canonical way (ZFS below DRBD), the disk usage is so much higher. For example, in the following output, zpool0/site271_ds is just a dataset (32K recordsize), whereas zpool0/site271_zvol_00000 is a drbd resource on a zvol (default volblocksize). Using zvols under DRBD consumes way more storage with the same exact data in both.

[root@store11a ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 245G 111T 192K /zpool0
zpool0/site271_ds 87.9G 111T 87.9G /zpool0/site271_ds
zpool0/site271_zvol_00000 157G 111T 157G -
[root@store11a ~]#
[root@store11a ~]# zfs get all zpool0/site271_ds|grep compress
zpool0/site271_ds compressratio 1.40x -
zpool0/site271_ds compression zstd local
zpool0/site271_ds refcompressratio 1.40x -
[root@store11a ~]#
[root@store11a ~]# zfs get all zpool0/site271_zvol_00000|grep compress
zpool0/site271_zvol_00000 compressratio 1.17x -
zpool0/site271_zvol_00000 compression zstd local
zpool0/site271_zvol_00000 refcompressratio 1.17x -
[root@store11a ~]#
[root@store11a ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rl-root 200G 15G 186G 8% /
/dev/sda2 960M 510M 451M 54% /boot
/dev/sda1 599M 7.1M 592M 2% /boot/efi
zpool0 112T 256K 112T 1% /zpool0
zpool0/site271_ds 112T 88G 112T 1% /zpool0/site271_ds
/dev/drbd1000 130G 105G 26G 81% /fs/site271_zvol
[root@store11a ~]#

No worries, have you tried using a zfsthin LINSTOR storage pool?

linstor sp create zfsthin ...

This way you can take advantage of ZFS sparse (thin) volumes.

I would expect this to be the case when comparing a dataset (in simplified terms, a filesystem) to a zvol (block device). A more appropriate comparison would be something like a ZFS dataset vs EXT4 filesystem used space.

Yes, understood. I realize I wasn’t very clear in my previous message. What I’m trying to highlight is that the zvol is inexplicably large. The DRBD resource was spawned as a 130GB volume, which somehow created a 157GB zvol. If I need to create 400 zvols, and every zvol is going to consume 20% more storage than the requested size of the DRBD disk, we’ll run out of storage quickly. Datasets do not require size specification. They consume the pool storage as needed. Plus, they benefit from compression, which zvols apparently do not. So, not only are they easier to manage, but they are much more efficient in storage utilization. I really want to build the stack the canonical way, with ZFS below DRBD, but these findings are forcing me to try it a different way. If I create one DRBD disk per physical disk, and build my raidz on top of the DRBD disks, then everything gets a lot easier and more efficient.

Any idea why spawning a 130GB DRBD resource creates a 157GB zvol? (Actually, it was 235GB because zfs gave it a default refreservation of 180%. I got it down to 157GB by setting refreservation to the same size as the zvol).

Regardless of DRBD, have you tried different volblocksize parameters when creating zvols such as 8, 16 , 32, 64K?

Not yet. That will be my next test. The following article has a nice breakdown of storage consumption based on array width and volblocksize.

That said, it’s only half the battle. Even if I can get the zvol down to somewhere near the size of the DRBD disk, it still does not benefit much from compression.

It seems like Linstor is creating bigger zvols (157G) than if you just do it from the command line (130G).

[root@store11a ~]# zfs create zpool0/site271_zvol_4k -o volblocksize=4K -o compression=zstd -o refreservation=130G -V 130G
[root@store11a ~]# zfs create zpool0/site271_zvol_8k -o volblocksize=8K -o compression=zstd -o refreservation=130G -V 130G
[root@store11a ~]# zfs create zpool0/site271_zvol_16k -o volblocksize=16K -o compression=zstd -o refreservation=130G -V 130G
[root@store11a ~]# zfs create zpool0/site271_zvol_32k -o volblocksize=32K -o compression=zstd -o refreservation=130G -V 130G

[root@store11a ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 895G 111T 192K /zpool0
zpool0/site271_ds 87.9G 111T 87.9G /zpool0/site271_ds
zpool0/site271_zvol_00000 157G 111T 157G -
zpool0/site271_zvol_16k 130G 111T 99.5K -
zpool0/site271_zvol_32k 130G 111T 99.5K -
zpool0/site271_zvol_4k 130G 111T 99.5K -
zpool0/site271_zvol_64k 130G 111T 99.5K -
zpool0/site271_zvol_8k 130G 111T 99.5K -

[root@store11a ~]# zfs get all zpool0/site271_zvol_00000|egrep “volblock|refres”
zpool0/site271_zvol_00000 volblocksize 8K default
zpool0/site271_zvol_00000 refreservation 130G local
zpool0/site271_zvol_00000 usedbyrefreservation 0B -

Also, even though the zvol is 157GB in size, df shows the size requested in Linstor (130GB).

[root@store11a ~]# df -h|egrep “File|drbd”
Filesystem Size Used Avail Use% Mounted on
/dev/drbd1000 130G 105G 26G 81% /fs/site271_zvol

Just curious, what is the volblocksize currently in use for DRBD volumes?

zfs get volblocksize zpool0/site271_zvol_00000

I do expect that to be the case here, that is, assuming the output of zfs get volsize zpool0/site271_zvol_00000 matches the requested volume size in LINSTOR/DRBD.

Can you post the output of zfs get all zpool0/site271_zvol_00000 just so we have all the information?

Sorry, I destroyed it and started over fresh. With more testing, I confirmed that the 157GB size of the zvol is not Linstor’s fault. I found the command that Listor was issuing in the zfs history and it is perfectly normal. I was also able to reproduce the condition outside of DRBD just using zfs commands. I don’t know what’s going on, but I know it’s not a Linstor/DRBD issue.

That said, the zvols seem to be taking up 157GB of storage whether I create them sparse or not.

root@store11a zpool0]# zfs create -s zpool0/zvol_8k -o volblocksize=8K -o compression=zstd -V 130G
root@store11a zpool0]# zfs create zpool0/zvol_8k_thick -o volblocksize=8K -o compression=zstd -V 130G

I created XFS filesystems on both and copied 100GB of data to both.

Then…

[root@store11a zpool0]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 402G 111T 263K /zpool0
zpool0/ds_32k 87.9G 111T 87.9G /zpool0/ds_32k
zpool0/zvol_8k 157G 111T 157G -
zpool0/zvol_8k_thick 157G 111T 157G -
[root@store11a zpool0]#
[root@store11a zpool0]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rl-root 200G 15G 186G 8% /
/dev/sda2 960M 510M 451M 54% /boot
/dev/sda1 599M 7.1M 592M 2% /boot/efi
zpool0 112T 384K 112T 1% /zpool0
zpool0/ds_32k 112T 88G 112T 1% /zpool0/ds_32k
/dev/zd0 130G 105G 26G 81% /zpool0/zvol_8k
/dev/zd16 130G 105G 26G 81% /zpool0/zvol_8k_thick

I don’t understand the storage consumption yet. Still investigating.

At this point I’m in a big detour and outside the scope of this forum. I’ll advise when I circle back around to asking more in-scope questions.

Whoops, I missed this section of output in my last reply:


Sounds good. I was just about to ask you what happens when you copy the filesystems between the zvols with differing volblocksize values before your last replies.

It appears you’re hitting some underlying issues related to padding with RAID-Z.

Here’s other interesting threads that may help:

After considerable soul-searching, I ended up building the stack using my original plan. There are 9 x NVME drives in each server. I have a DRBD disk for each one. I then built a raidz on top of the 9 DRBD disks, and I’m creating individual zfs datasets for each customer. I realized the main reason I wanted zfs was for its compression, and that is working great. Also, using datasets makes daily administration much easier and more straightforward. No more worrying about the huge difference in storage requirements from customer to customer, no resizing drbd disks, LVM volumes, xfs filesystems, etc., and overall storage utilization is fantastically better than it would have been using zvols under DRBD disks. Also, the stack is simpler because of fewer layers. I realize I’m giving up some of zfs’ sexier capabilities, but the tradeoffs are worth it to me.

The next question is: pacemaker or drbd-reactor? I have both in my environment, and I much prefer using drbd-reactor on my database servers because customer-facing services are atomic. There’s one drbd-reactor resource with one drbd disk, one filesystem, and one mysql instance per customer. DRBD disks and their dependent resources can be individually moved between cluster nodes with ease, and drbd-reactor is overall much easier to administer than pacemaker. That said, the situation now is different. I have 9 DRBD disks all part of one raidz. Any actions that drbd-reactor triggers would have to be coordinated with all the other DRBD disks below the zpool. If any drbd-reactor resource wants to trigger a failover, it might have to gracefully manage failing over ALL drbd disks, which might involve stopping application services, unmounting all zfs datasets, exporting and imprting pools, etc. Any thoughts on whether all this is feasible with drbd-reactor?

DRBD Reactor does not support complex collocation or ordering constraints between services running on different DRBD resources. You cannot tell DRBD Reactor that all ZFS pools must be successfully exported before attempting to demote any of the DRBD devices. For that you will have to use Pacemaker.

That was the conclusion I reached. While it’s probably feasible to use drbd-reactor in conjunction with scripts to accomplish some type of coordinated failover, it’s more work than it’s worth.