DRBD 9.3.0: resync may stall in a 3-node cluster leading to blocked I/O when promoting an Inconsistent node

Environment

  • DRBD version: 9.3.0

  • Cluster size: 3 diskful nodes

  • Resource: r1

  • Nodes: drbd1, drbd2, drbd3

  • Protocol: C

  • Single primary (no dual-primary)


Problem Description

In a 3-node DRBD cluster we occasionally encounter a situation where resynchronization stalls and all write I/O to the mounted filesystem blocks indefinitely.

When this happens:

  • resync progress stops

  • writes to the mounted filesystem hang

  • the filesystem cannot be unmounted (umount blocks)

  • DRBD status shows peer / dependency suspended states between nodes

Restarting DRBD on one node (drbdadm down/up) causes the resync topology to change and the system recovers.


Reproduction Scenario

The issue can be reproduced with the following sequence.

Nodes involved:

drbd1
drbd2
drbd3

Initial state: all nodes are UpToDate.


Step 1

Promote drbd3 to Primary and mount the filesystem.


Step 2

Disconnect the network between drbd3 and drbd2.


Step 3

Write data on drbd3.

Result:

drbd2 becomes Outdated

Step 4

Restore the network between drbd3 and drbd2.

Result:

drbd3 → drbd2 resync starts

Step 5

Promote drbd2 to Primary and mount the filesystem.

Then disconnect the network between drbd2 and drbd3.

Because drbd2 is now Primary and receives writes:

drbd3 becomes Outdated

Step 6

Restore the network between drbd2 and drbd3.

Current state becomes:

drbd1 : UpToDate
drbd2 : Inconsistent (Primary)
drbd3 : Outdated

DRBD chooses the following resync direction:

drbd3 → drbd2

Step 7

Write data on drbd2.

At this point the problem occurs.


Observed Behavior

  1. The resync from drbd3 → drbd2 stalls (no further progress).

  2. The filesystem cannot be unmounted.

DRBD status shows the following relationship:

drbd1 → drbd2
    drbd1: resync-suspended: peer
    drbd2: resync-suspended: dependency

This suggests that:

  • drbd1 → drbd2 resync is waiting for another resync to complete

  • drbd2 ← drbd3 is expected to complete first

However:

drbd3 → drbd2 resync shows "suspended: no"
but makes no progress.

As a result, the whole system appears to be stuck.


Recovery

If we restart DRBD on drbd3:

drbdadm down r1
drbdadm up r1

The resync topology changes to:

drbd1 → drbd2
drbd1 → drbd3

Both nodes resync from drbd1, and the cluster returns to normal operation.


Question

Is this behavior expected in DRBD 9.3.0 when a node that is still Inconsistent is promoted to Primary and receives writes?

Or could this indicate a resync scheduling issue where DRBD chooses a suboptimal sync source (Outdated node) and the resync pipeline becomes stalled?

Any guidance on how to avoid or diagnose this situation would be appreciated.

this is drbdadm status results