Be Honest, Is My Storage Stack Stupid?

My server has 8 x 15TB NVME drives. Here’s my plan:

Create a RAIDZ. Total usable space in the array will be approx. 100TB.

On top of that, create 10 x 10T drbd disks.

Create an LVM volume group from all the drbd disks.

Create a few hundred LVs with separate XFS filesystems, one for each customer.

The goal is to balance speed, manageability, and fault-tolerance.

Having all the LVs in a single volume group simplifies resource deployment.

Having 10 x DRBD disks speeds up replication between the nodes a lot, and reduces the impact of a drbd disk failure.

Having separate filesystems for each customer allows them to be grown on a per-customer basis and minimizes the impact of a corrupted filesystem.

I’m a little concerned about the complexity of having LVM on top of drbd, on top of zfs. In a failover scenario, all those filesystems would have to be unmounted and the LVs deactivated before the failover could occur.

Also, I’m already confused about storage allocation. I created a zfs pool which does not have encryption or compression:

[root@store11a ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 2.11M 93.9T 162K /zpool0
zpool0/fs 162K 93.9T 162K /zpool0

As you can see, it shows 93.9T available.

I then created a 10T linstor resource as follows.

[root@store11a ~]# lst rg spawn-resources rgroup0 drbd0 10T

But now, the zfs pool shows 17TB used.

[root@store11a ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
zpool0 17.2T 76.8T 162K /zpool0
zpool0/drbd0_00000 17.2T 93.9T 3.69G -
zpool0/fs 162K 76.8T 162K /zpool0

What’s going on?

Hilarious question, while your storage stack would indeed “function”, there are some things to consider:

On top of that (ZFS RAIDZ), create 10 x 10T drbd disks.

This layer (DRBD resources backed by ZFS volumes) is a standard approach when using ZFS. There is nothing “stupid” about this :wink:

Create an LVM volume group from all the drbd disks.

This is where it could get problematic. What happens if one of your DRBD resources fails? It could bring multiple logical volumes down and affect the entire LVM group.

Also, you can already create ZFS volumes from your RAIDZ pool. Why not simply create “a few hundred” zvols for use with DRBD instead of layering LVM on top of ZFS?

Instead of having “10 x DRBD disks (resources)” underneath LVM, you would have ~200-300 DRBD resources, each backed by ZFS volumes and XFS filesystems sitting above the DRBD resources. While this might seem like a large resource count, this is fairly typical for a LINSTOR cluster that can easily manage large amounts of DRBD resources.

Having 10 x DRBD disks speeds up replication between the nodes a lot, and reduces the impact of a drbd disk failure.

Not necessarily by putting them underneath a single LVM. Again, see above. It’s probably best to eliminate LVM from your current storage stack proposal.

Having separate filesystems for each customer allows them to be grown on a per-customer basis and minimizes the impact of a corrupted filesystem.

Agreed, this can be easily done by using ZFS volumes with DRBD and XFS, no need for LVM as an extra layer.

I’m a little concerned about the complexity of having LVM on top of drbd, on top of zfs. In a failover scenario, all those filesystems would have to be unmounted and the LVs deactivated before the failover could occur.

Yep, this is another downside to your proposed storage stack proposal. Again, by cutting out LVM you can easily failover DRBD resources, or even run a load-balanced configuration between multiple nodes.

Also, I’m already confused about storage allocation. I created a zfs pool which does not have encryption or compression: […] What’s going on?

RAIDZ implies parity for surviving drive failures in the ZFS pool, so any allocated blocks will require additional storage allocated for parity.

Hope that helps give you some guidance.

1 Like

Hi Ryan,

We reached the same conclusion, so we’ve been experimenting with the approach you described using a few hundred zvols and a few hundred DRBD disks to go with them, and a few hundred filesystems on those disks. Here are the two problems we face:

Just wanted to say that this blog post will explain some of the ZFS accounting with RAIDZ and ZVOLs better than I can. You might take a look at the volblocksize settings on your DRBD backing disks (ZVOLs) to see if that’s chewing up some unnecessary storage space. 16K should be the default volblocksize value in ZFS 2.2 and above.

Here are the two problems we face:

Just a heads up, looks like a missed copy and paste on your end?

I’m also not in our ticketing system all that much these days, so you may have already had these questions answered elsewhere if they were part of a ticket.

Here was the original email. I said “two problems” but then added a third.

Hi Ryan,

We reached the same conclusion, so we’ve been experimenting with the approach you described using a few hundred zvols and a few hundred DRBD disks to go with them, and a few hundred filesystems on those disks. Here are the two problems we face:

1. The cluster is for an FTP service. Since there is only one FTP service, then everything must always stay together. Wherever the FTP service is running, all DRBD disks must be primary on that node. That means, in the case of any failure that triggers a resource failover, all 400-ish DRBD resources, plus the FTP service, would have to failover together at the same time. That seems kinda clunky and I am trying to minimize the number of moving parts.

2. I don’t want to go with a single DRBD resource because then (a) all the eggs are in on basket, and (b) speed suffers because DRBD is single-threaded. Having multiple DRBD resources eliminates DRBD as a bottleneck.

3. We still have the overhead issue. When we create a 10T resource, zfs allocates 17TB, and 70% overhead cannot be explained by RAIDZ1 parity. Plus, we can seemingly “eliminate” the problem by setting the zfs “refreservation” value to the same size as the DRBD disk. Then a 10T DRBD disk only uses 10T in ZFS. So far, nobody I’ve asked can tell me if doing that creates a problem I’ll regret later.

-Eric

We addressed #2 by creating the zvols with 128K sectors. Then the overhead went away.