Environment
-
DRBD version: 9.3.0
-
Cluster size: 3 diskful nodes
-
Resource: r1
-
Nodes: drbd1, drbd2, drbd3
-
Protocol: C
-
Single primary (no dual-primary)
Problem Description
In a 3-node DRBD cluster we occasionally encounter a situation where resynchronization stalls and all write I/O to the mounted filesystem blocks indefinitely.
When this happens:
-
resync progress stops
-
writes to the mounted filesystem hang
-
the filesystem cannot be unmounted (
umountblocks) -
DRBD status shows
peer/dependencysuspended states between nodes
Restarting DRBD on one node (drbdadm down/up) causes the resync topology to change and the system recovers.
Reproduction Scenario
The issue can be reproduced with the following sequence.
Nodes involved:
drbd1
drbd2
drbd3
Initial state: all nodes are UpToDate.
Step 1
Promote drbd3 to Primary and mount the filesystem.
Step 2
Disconnect the network between drbd3 and drbd2.
Step 3
Write data on drbd3.
Result:
drbd2 becomes Outdated
Step 4
Restore the network between drbd3 and drbd2.
Result:
drbd3 → drbd2 resync starts
Step 5
Promote drbd2 to Primary and mount the filesystem.
Then disconnect the network between drbd2 and drbd3.
Because drbd2 is now Primary and receives writes:
drbd3 becomes Outdated
Step 6
Restore the network between drbd2 and drbd3.
Current state becomes:
drbd1 : UpToDate
drbd2 : Inconsistent (Primary)
drbd3 : Outdated
DRBD chooses the following resync direction:
drbd3 → drbd2
Step 7
Write data on drbd2.
At this point the problem occurs.
Observed Behavior
-
The resync from drbd3 → drbd2 stalls (no further progress).
-
The filesystem cannot be unmounted.
DRBD status shows the following relationship:
drbd1 → drbd2
drbd1: resync-suspended: peer
drbd2: resync-suspended: dependency
This suggests that:
-
drbd1 → drbd2resync is waiting for another resync to complete -
drbd2 ← drbd3is expected to complete first
However:
drbd3 → drbd2 resync shows "suspended: no"
but makes no progress.
As a result, the whole system appears to be stuck.
Recovery
If we restart DRBD on drbd3:
drbdadm down r1
drbdadm up r1
The resync topology changes to:
drbd1 → drbd2
drbd1 → drbd3
Both nodes resync from drbd1, and the cluster returns to normal operation.
Question
Is this behavior expected in DRBD 9.3.0 when a node that is still Inconsistent is promoted to Primary and receives writes?
Or could this indicate a resync scheduling issue where DRBD chooses a suboptimal sync source (Outdated node) and the resync pipeline becomes stalled?
Any guidance on how to avoid or diagnose this situation would be appreciated.

