When ZFS is above DRBD, Which One Handles a Rebuild?

Scenario: There are 2 x physical disks in a server. There are 2 x DRBD disks, 1 per physical disk. Above that, there is a ZFS mirror of the two DRBD disks.

If one of the physical disks fails and must be replaced, what should happen? Does DRBD handle the whole thing with a resync, or does ZFS to a mirror rebuild?

I guess this same question would apply if an LVM mirror is above DRBD.

NOTE: I am already aware that this is a non-canonical approach. If you want to ask why, that’s fine, I’ll explain, but please also answer the question.

When you say “2 x DRBD disks, 1 per physical disk”, presumably this means the physical disks are partitioned in some way. For example:

  • /dev/sda1 and /dev/sda2
  • /dev/sdb1 and /dev/sdb1

Then /dev/drbd0 runs between /dev/sda1 and /dev/sdb1, and /dev/drbd1 runs between /dev/sda2 and /dev/sdb2.

(Of course, it would be more common to run DRBD between disks in two different servers; otherwise DRBD doesn’t give you any value above mdraid or similar)

On top of that, you have a ZFS mirrored vdev comprising /dev/drbd0 and /dev/drbd1.

Does this describe your scenario accurately?

If one of the physical disks fails and must be replaced, what should happen? Does DRBD handle the whole thing with a resync, or does ZFS to a mirror rebuild?

The former.

When the disk fails DRBD will redirect all the I/O to the other side of the DRBD mirror. ZFS won’t even notice that something has failed. When you replace the disk, DRBD will copy the data back from the other disk (both partitions), and again ZFS won’t notice.

A subsequent ZFS scrub will check data between the two sides of the ZFS mirror - and because of checksums, it can tell if either side of the ZFS mirror has become corrupted somehow, and fix it using data from the other side (if that other copy is valid). This is something that is normally scheduled periodically, to fix “bit rot”.

DRBD can’t do this; it writes the same data to both sides, but if one side were to become corrupted it wouldn’t notice. On readback you would get data from one side or the other - usually the closest DRBD replica if they were on different hosts. However, ZFS would notice if the read has a bad checksum, in which case it would read from the other partition (i.e. the other half of the ZFS mirror). Note that it has no way to access the other DRBD replica of the same partition. As far as ZFS is concerned, there are only two accessible copies of the data, not four.

Now, if DRBD were running with a single replica (e.g. because the second disk has already failed), then ZFS mirroring is the only data recovery mechanism which can come into play. But that would only succeed if the physical drive fails in such a way that one partition becomes inaccessible but the other doesn’t. This might be the case for a spinning drive where a few sectors or tracks have gone bad.

Similarly, ZFS mirroring would be the recovery mechanism if you had DRBD replication but for some reason one partition became inaccessible in both drives simultaneously, but the other partition was accessible in at least one drive. That would be an unusual situation though.

Thanks for replying. I may have done a poor job of describing the build.

It is like this:

/dev/nvme0n1 → /dev/drbd0 → replicates to separate server.
/dev/nvme1n1 → /dev/drbd1 → replicates to separate server.

zpool_mirror consists of /dev/drbd0 and /dev/drbd1. The zpool only exists on the primary DRBD node.

You may ask, why am I even considering this? Because it allows me to create zfs filesystems and run MySQL databases on them fast. In my sysbench-tpcc tests, I’m getting about 3K TPS and 80K QPS.

The alternative (and canonical) approach is to run DRBD on top of ZFS, in which case there is only 1 DRBD disk and it uses a zvol as a backing device. Then I create an XFS or EXT4 filesystem on the DRBD disk. Using this setup, I get maybe 900 TPS and 20K QPS. I would use the canonical approach if it was not so slow.

OK, that’s clear. What I wrote before applies, except that the second NVMe drives are in the second server, not /dev/sdbX in the first server. If /dev/nvme0n1 fails in the first server, then reads and writes to /dev/drbd0 will be directed over the network to the second server - so will be affected by the network speed and latency. Then when you replace /dev/nvme0n1, it will be backfilled by DRBD. ZFS mirroring does not come into play.

I wonder if the reason it’s so fast is that all your mysql transactions are being batched up in RAM into a ZFS transaction group. You should check whether suddenly pulling the power actually loses committed database transactions (if that’s important for your application).

In principle you don’t need zfs mirroring to get this speed. You could just run zfs on top of /dev/drbd0 - but you would lose zfs’s repair of bitrot.

I suppose you could run zfs mirroring between two nodes, with the remote disk being accessed via a network protocol like nbd or iSCSI - which would eliminated drbd - but I’ve never attempted such a setup. I don’t know how well zfs would cope if the remote disk became intermittently accessible. You might want to search and see if anyone else has tried a setup like this.