We are trying to decide if the default strategy (mask the error and detach from the disk) is the best approach for our use case. We’re considering creating a RAID-Z on top of DRBD. If we allow DRBD to pass I/O errors upward, then ZFS can handle them automatically. However, if we use the default strategy, then ZFS will remain unaware, and we will have to address the failures manually. Per the User Guide, DRBD detaches on the first occurrence of any I/O error. There are no failure type/threshold parameters. Given that bitwise errors, sector read/write errors, etc., are presumably more common than total disk failures, it seems like allowing ZFS to perform its magic could save work, downtime, and money (possibly fewer unnecessary drive replacements , although I realize there’s an argument to be made that once a disk produces its first error, more may be likely to follow. Maybe). Are there downsides to this approach?
This might help me. If the error handling strategy is pass-on, what happens in the case of a total failure of the backing device? What role/connection state/disk state does a resource go into?
Please bear in mind that the forums here are now only roughly 100 days old. It is our hope that as more users discover and join it will grow to be much more active community. The team at LINBIT tries to assist when time allows, but it’s our hope that this will grow to a place where the community can also help each other.
The on-io-error
options are detatch
, pass_on
, or you can invoke a handler. While it personally doesn’t make much sense to me, perhaps you could implement this “allow 3 errors then detach” functionality with a custom handler script? I am not sure how you would record/count the errors best though.
Why not just simply use ZFS to back the DRBD devices? This is definitely the more common approach and would still get you both the benefits of RaidZ, DRBD and I would expect RaidZ’s error handling.
Sorry, I didn’t consider that the forum is very young. I’ve been using DRBD since 2006 and I talk it up whenever I get the chance, which is often. I only discovered the Slack channel last year, and the Community forum a few months ago. I think the forum is a great idea. Unfortunately, it’s easy to get impatient when you have $50K worth of hardware sitting dormant while you try to figure out your storage stack.
The
on-io-error
options aredetatch
,pass_on
, or you can invoke a handler. While it personally doesn’t make much sense to me, perhaps you could implement this “allow 3 errors then detach” functionality with a custom handler script? I am not sure how you would record/count the errors best though.
Yes, that doesn’t make much sense to me, either. I referenced the section from the User Guide so people would know I read it. However, it leaves some questions open.
Why not just simply use ZFS to back the DRBD devices? This is definitely the more common approach and would still get you both the benefits of RaidZ, DRBD and I would expect RaidZ’s error handling.
- Then you would have ZFS above and below DRBD, or you end up with a stack like ZFS+DRBD+LVM+XFS. Both approaches have their own complexities and performance challenges. Generally, the fewer layers in the stack, the better, so I was thinking that DRBD+ZFS would be simplest and best.
- Also, I was considering implementing the suggestion in the Linstor User Guide, section 2.9.2., Using Storage Pools To Confine Failure Domains to a Single Back-end Device. You can only do that if you have one physical backing device per DRBD disk. I would not use Linstor in that case. I would just configure one DRBD resource per disk manually to accomplish the same failure domain confinement.
Personally I wouldn’t risk putting anything below ZFS other than physical disks. Consider either using ZFS or LVM below DRBD as that’s been a proven and well tested setup.
What risks do you foresee? It seems to be working well so far in testing.
I would comment in the same direction as gianni.milo and I have no real clear technical argumentation.
In my assumption ZFS is designed to work best with drives and the filesystem.
As a matter of fact, I assume the pass-on mode does not exist. But maybe it is a good idea?
I added another scenario, as devin suggested.
Hi,
pass-on is a selectable mode mode for DRBD.
The design in your drawing is in fact one we had also considered. The only difference between yours and the traditional one I posted is you have removed the LVM layer between DRBD and the filesystems. Either way, we rejected it because there is only one application, an FTP server. Since the application and the DRBD filesystems must always be colocated, then if any DRBD resource needs to failover, then all 400+ would have to failover. That’s a bit much.
DRBD’s disk { on-io-error pass_on; }
setting doesn’t pass the IO error up the stack (to ZFS in your example case). The pass_on
option will mark the DRBD device as Inconsistent
and pass the IO to a peer. As long as the peer completes the IO without error, which it should be able to do, the IO will complete successfully and ZFS will not have detected anything wrong.
So, ZFS does not get to perform magic in either the detach
or pass_on
scenario, as the IO error will be masked by DRBD.
Hi Matt,
That’s the opposite of what it says in the User Guide. The two strategies described there are:
Passing on I/O errors
If DRBD is configured to pass on I/O errors, any such errors occurring on the lower-level device are transparently passed to upper I/O layers. Therefore, it is left to upper layers to deal with such errors (this may result in a file system being remounted read-only, for example). This strategy does not ensure service continuity, and is therefore not recommended for most users.
Masking I/O errors
If DRBD is configured to detach on lower-level I/O error, DRBD will do so, automatically, upon occurrence of the first lower-level I/O error. The I/O error is masked from upper layers while DRBD transparently fetches the affected block from a peer node, over the network. From then onwards, DRBD is said to operate in diskless mode, and carries out all subsequent I/O operations, read and write, on the peer node(s) only. Performance in this mode will be reduced, but the service continues without interruption, and can be moved to the peer node in a deliberate fashion at a convenient time.
The “upper layer” in my case would be ZFS.
You’re correct there, and we’ll get that updated today because it is incorrect
From the drbd.conf
man page:
on-io-error handler
handler is taken, if the lower level device reports io-errors to the upper layers.
handler may be pass_on, call-local-io-error or detach.
pass_on: The node downgrades the disk status to inconsistent, marks the erroneous
block as inconsistent in the bitmap and retries the IO on the remote node.
call-local-io-error: Call the handler script local-io-error.
detach: The node drops its low level device, and continues in diskless mode.
As long as the IO that was “passed on” to a peer node succeeds there, the local DRBD device will not return an error to the application above it.
I tested this (with simulated bad sectors) in a local environment and could see that DRBD detected the IO error when reading/writing to that area of disk but ZFS never learned about it.
The DRBD device with the bad sectors was marked Inconsistent
:
# drbdadm status
r0 role:Primary
disk:Inconsistent
linbit-1 role:Secondary
peer-disk:UpToDate
linbit-2 role:Secondary
peer-disk:UpToDate
r1 role:Primary
disk:UpToDate
linbit-1 role:Secondary
peer-disk:UpToDate
linbit-2 role:Secondary
peer-disk:UpToDate
r2 role:Primary
disk:UpToDate
linbit-1 role:Secondary
peer-disk:UpToDate
linbit-2 role:Secondary
peer-disk:UpToDate
r3 role:Primary
disk:UpToDate
linbit-1 role:Secondary
peer-disk:UpToDate
linbit-2 role:Secondary
peer-disk:UpToDate
r4 role:Primary
disk:UpToDate
linbit-1 role:Secondary
peer-disk:UpToDate
linbit-2 role:Secondary
peer-disk:UpToDate
I see the messages in the logs from my read/write IOs:
# dmesg | tail -n6
[229370.177380] drbd r0/0 drbd10: Local IO failed in drbd_request_endio.
[229370.177498] drbd r0/0 drbd10: disk( UpToDate -> Inconsistent ) [local-io-error]
[229370.177514] drbd r0/0 drbd10: local WRITE IO error sector 260648+1024 on dm-1
[229407.595827] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1383675829 wd_nsec: 1383675831
[229721.475219] drbd r0/0 drbd10: Local IO failed in drbd_request_endio.
[229721.475285] drbd r0/0 drbd10: local READ IO error sector 261096+64 on dm-1
But nothing made it up to my RAID-Z:
# zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
drbd10 ONLINE 0 0 0
drbd11 ONLINE 0 0 0
drbd12 ONLINE 0 0 0
drbd13 ONLINE 0 0 0
drbd14 ONLINE 0 0 0
errors: No known data errors
Indeed the (wrong) description would be a very useful feature for filesystems with error correction abilites (zfs, btrfs) on the upper layer.
Thanks for letting me know about the error in the user guide and for doing the extra testing. That helps me understand better.
That said, ZFS can detect problems that DRBD would not necessarily see, such as invisible bit rot, so it still gets to do some magic, just not related to device i/o failures (unless it is deployed below DRBD).
Another reason I’m trying to use ZFS above DRBD has to do with flexibility. If I use XFS or EXT4 above DRBD, then I am left with bad choices:
-
Put all 400 customer folders in one big 100TB XFS or EXT4 filesystem. That sounds like a nightmare waiting to happen.
-
Create 400 separate filesystems, and try to size each one appropriately to meet the needs of the customer. Some customers need a lot of space (3-4T) and others not much (10-100GB), and all sizes in between. With this approach, I can either find myself frequently growing filesystems, or just size them some percentage higher than their current usage, which can be wasteful given that growth patterns are not the same between customers. Whereas, with zfs, I just create datasets and let zfs size them from the pool as needed.
Also, if I want to eliminate DRBD itself as a potential bottleneck, I must create multiple DRBD disks. To achieve any degree of flexibility, I must add them all to a volume group, so LVM becomes another layer. It doesn’t solve the sizing problem, but it does make it easier.
What do you think about having ZFS below and above DRBD?
I thought a bit about the topic and this is the result.
ZFS is a great filesystem. From a point of view of maximum flexibility it’s desireable to have ZFS as lower AND upper Filesystem. From a point of view of performance I would assume that would be very bad solution (I read strong recommendations not to use a CoW-Filesystem on top of another) and so you can only have one of these two options.
What are the advantages of ZFS?
- Self-Healing: ZFS can repair data errors in redundant virtual Devices. (Mirror,RAIDZx)
- Efficient Transparent Compressed Write Algorithm: ZFS can improve I/O Throughput to drives by using LZ4 and thus reduce the amount of data writting to disk/ssd
- ARC: Adaptive Replacement Cache aka read cache: ZFS is good at reading from drives and caching data reducing the number of read transactions from the device
- Flexible Datasets: ZFS can create flexible Datasets on its pools
- Snapshotting: ZFS can create snapshots efficiently and with ease of use
What are the primary advantages of ZFS at the lower level?
- Data corruption can effectively and efficiently be avoided
- Transparent compression can improve storage performance (because less data must be read and written)
What are the primary advantages of ZFS at the upper level?
- ZFS datasets can be resized very easily
- ZFS snapshots on a per customer level are very useful and efficient for backup purposes.
What disadvantages does ZFS have?
- Existing RAIDZ pools can not be expanded with single drives. Neiter single Drives nor VDEVs can be removed from a pool as soon as a one RAIDZx VDEV is added. One may add additional VDEVs but that will most likely result in performance degradation. One may replace drives by bigger drives or add additional pools. One may use only striped mirrors instead of RAIDZx pools to keep full flexibility - if that is important at the cost of reduced capacity for the same price and with gain of performance.
Evaluation of the Features regarding the use of ZFS as lower or as upper filesystem
- For me most important seems to avoid data corruption with its integrity and self healing features at the lower filesystem level. There seems no alternative method to do this in a similar efficent way. There is DM Integrity for LVM which can do that there, but the performance drops by 50% when using this.
- Ease of use with growing customer file systems are also very important, since this will be normal daily business. This is very easy with ZFS and I think it’s still doable - even a bit more complex - with drbd / xfs alone. One may write a script to do the different steps and which will be called with a single short command “grow_instance customer_name 100G”. I also think, this could be fully automated.
- Snaphots as Backup-Possibility are also very attractive. Next best alternative for Snaphots would be a good and efficient backup-system which does deduplication. I’m very fond of the open source solution BackupPC, which I’d be using for 1-2 decades now: High Performance, Very efficient, Very Reliable. BackupPC performs better when there are not too much files to backup. I have an Instance with 26 Billion files in backup space. It’s kind of high load, but it’s still working. In this case of FTP-Server, I would assume there are not so much tiny but more larger files as in my case of backing up lots of operating systems.
- Transparent compression in this scenario might not be that important since NVMe SSDs have very high performance
With my evaluation, I would use ZFS as a lower level backing file system for this scenario.
Since the application and the DRBD filesystems must always be colocated, then if any DRBD resource needs to failover, then all 400+ would have to failover. That’s a bit much.
When there’s a failover, 400+ FTP-Server-Instances have to failover anyway. I would assume that this takes a lot more time than the DRBD-Failover. From my intuition I would approve what @kermat said: Pretty normal business. Here too you may have a little script switching all instances over to the other node or to promote them to primary when the former active node had crashed.
Thanks for the detailed response. Lots to think about.
When there’s a failover, 400+ FTP-Server-Instances have to failover anyway. I would assume that this takes a lot more time than the DRBD-Failover. From my intuition I would approve what @kermat said: Pretty normal business. Here too you may have a little script switching all instances over to the other node or to promote them to primary when the former active node had crashed.
There’s would be only one FTP server instance, with either (a) 400 separate DRBD disks with 400 filesystems, or (b) a handful of DRBD disks–maybe 4 to 8–in a volume group, and 400 LVs with 400 filesystems.
Ok. If that’s only one FTP-Server, there is another solution: Linux Quotas. I did not work a lot with linux quotas, because a) I rarely need it and b) zfs is so much easier.
You can set the primary group for all users of customer-x to customer-x-group. So all uploads will have that group. Set the per-customer quota on those groups and you have your customer quota as they ordered it. No need for zfs.
If a customers upgrades or downgrades the product: Just change the quota.
The benefit of zfs in my scenario is that each customer has a fully separate filesystem, in which case corruption only impacts that one customer, and checking/rebuilding would be relatively fast. Using quotas, I’m probably back to using one big filesystem, which is a nightmarish option I am working to avoid. Also, quotas don’t provide RAID functionality, so I would have to bolt on mdraid or lvm raid. I’ve used them both before, and they are okay, but I’m trying to keep the number of stack layers to a minimum.
Ok. Understood. You surely have had your (bad) experiences with that. I’m not that deeply in your kind of scenario. Never mind.
It’s been years since I’ve used ZFS with FreeNAS/TrueNAS in production, but I’d argue ZFS excels below DRBD as it was designed to interface directly with true physical block devices. Yes, the filesystem features are neat, but do you really need them here?
ZFS will handle any errors from the physical storage devices, you can hot-swap devices, use cache, etc, and the volume management is a natural fit for creating backing disks (ZVOLs) for DRBD resources or letting a LINSTOR cluster manage the ZFS ZPOOL as the backing for the storage pool.
Not ideal no. Didn’t you mention a single FTP process though? Would all the separate mount points for 400 filesystems be a similar security concern?
ZFS dataset quotas would be easy to manage, sure. Still, 400 things to manage. Thinly provisioned LVM/ZFS volumes could be an option here to reclaim the “waste” for non-ZFS filesystems.
Either you’re adjusting 400 ZFS dataset quotas, or simply resizing 400 resources (and filesystems above each ZFZ/LVM backing volume)
For example, in LINSTOR you can resize an existing resource’s volume (volume 0):
# resize resource's 10G volume to 15G
linstor volume-definition set-size <resource> 0 15G
# resize ext4 filesystem (assuming /dev/drbd1000)
resize2fs /dev/drbd1000
I think your biggest risk with putting DRBD in a ZFS layer sandwich or ZFS/LVM layer sandwich where the upper layer is composed from an aggregate of DRBD devices is that your entire upper layer is dependent on a handful of DRBD resources.
Does it work? Yes. Can that approach increase performance? Yes. Can that approach limit the overall number of DRDB resources? Sure. Would I do it on my own systems for this use case? Probably not. I don’t like the extra dependencies on DRBD resources for the top layer.
I would lean towards this type of configuration:
[ FS ] [ FS ] [ FS ] [ FS ] [ FS ]
[ DRBD ] [ DRBD ] [ DRBD ] [ DRBD ] [ DRBD ]
[ ZVOL ] [ ZVOL ] [ ZVOL ] [ ZVOL ] [ ZVOL ]
[ -------------- ZFS ZPOOL --------------- ]
[ NVMe ] [ NVMe ] [ NVMe ] [ NVMe ] [ NVMe ]
Least amount of storage layers?
Out of the box LINSTOR configuration (ZFS storage pool)?
Performance increase from multiple DRBD resources?
Most DRBD resources?
By doing it this way, losing a couple of DRBD resources won’t bring down a multitude of client filesystems, just a handful of client filesystems.