In this development cycle, we had two bug reports worth mentioning. One
was about a perplexing kernel crash stack trace. Fortunately, the
customer, who runs a big fleet of DRBD nodes, could reproduce it and
produce a kernel crash dump of this crash.
Joel and I took it as a challenge to decode the puzzle. With joint
brains, we examined a “partly overwritten” malloc()'d object. We were
able to reconstruct what happened and found the bug.
After finding the bug, I looked at the git history to understand how we
introduced that bug.
These bugs increasingly stem from an unintended interaction of two
developers’ changes. Leaving us with the learning, utmost clarity of
interfaces, naming of locks and data structures, and clear expression
intention is key for avoiding such bugs in the future.
The Kubernetes CSI driver triggered the other crash in DRBD
(sometimes). The reason was that it downed resources in an unusual way
by deleting all minors first. Fixing that was, fortunately, a lot less
effort than the other one.
It will be welcome news for Ubuntu users that we tested heavily with the
upcoming Ubuntu-24.04 (“Noble”) release. Besides updating the kernel
compat for Linux-6.8, we have noticed that the newer kernel delivers
netlink packets out-of-order more often. We use these netlink packets to
deliver information for drbd-kernel-driver to user-space. If you use
drbd-reactor or LINSTOR on Ubuntu Noble, you need to use drbd-utils
9.28.0 (or newer) to get a drbdsetup that brings out-of-order netlink
packets back into sequence.
9.2.9 (api:genl2/proto:86-122/transport:19)
- Allow resync operations between secondaries if the sync source is
not connected with the primary node - Changes merged from 9.1.20
- Fix a kernel crash that is sometimes triggered when downing drbd
resources in a specific, unusual order (was triggered by the
Kubernetes CSI driver) - Fix a rarely triggering kernel crash upon adding paths to a
connection by rehauling the path lists’ locking - Fix the continuation of an interrupted initial resync
- Fix the state engine so that an incapable primary does not outdate
indirectly reachable secondary nodes - Fix a logic bug that caused drbd to pretend that a peer’s disk is
outdated when doing a manual disconnect on a down connection; with
that cured impact on fencing and quorum. - Fix forceful demotion of suspended devices
- Rehaul of the build system to apply compatibility patches out of
place that allows one to build for different target kernels from a
single drbd source tree - Updated compability code for Linux 6.8