DRBD 9.0.23 – Changing resync-rate during active synchronization has no effect

Hello,

We are using DRBD version 9.0.23 in a production environment with a basic Primary/Secondary setup.

After a disconnection of the secondary node, a full resync was automatically triggered. The process is currently around 17% complete, but progressing very slowly (about 250 KB/s), which seems to match the default resync-rate.

We attempted to dynamically change the resync speed from the primary node using the following command:

drbdsetup peer-device-options drbd0 1 0 --resync-rate=4M

We also tried disabling the planning logic with:

drbdsetup peer-device-options drbd0 1 0 --resync-rate=4M --c-plan-ahead=0

Both commands are accepted without error, but the resync speed does not change and stays at the original rate (~250 KB/s).

We’d like to confirm the following:

  • Is it expected behavior in version 9.0.23 that --resync-rate changes do not take effect during an ongoing resync?
  • Is it necessary to disconnect/reconnect or restart the sync process for changes to apply?
  • Is there a safe way to restart the resync without losing current progress (currently 17%)?

Any confirmation or experience from others using this version would be very helpful.

Thanks in advance.

Firstly, you should never lose current resync progress. Yes the timer will reset, but that doesn’t mean the resync restarted. For example. I have 300G to sync and I have complete 100G and the sync counter is now at 33%. Then there is a network interruption and the resync restarts. Now there will be 200G I need to resync, but my progress will restart at 0% as I’m 0% through the now remaining 200G.

Setting just a --resync-rate is likely to do very little with the dynamic sync controller running. In fact it’s just simply used as a starting value when the sync rate controller starts.

With the sync rate controller disabled (i.e. --c-plan-ahead=0) I would expect this to make a difference. I suspect you may just need to increase the max-buffers value. Here is a link to whole article I wrote on tuning the sync-speeds: Tuning DRBD's Resync Controller

Hi Devin,

Thanks a lot for your previous reply it was very helpful.

We’ve now successfully increased the resync speed from 250 KB/s to around 3 MB/s. This was achieved by tweaking the following parameters:

  • --c-plan-ahead=20
  • --resync-rate=3M
  • --c-min-rate=3M
  • --c-max-rate=6M

From what we’ve observed, both --resync-rate and --c-min-rate seem to define the guaranteed minimum sync speed. However, even with --c-max-rate set to 6M, the actual resync speed remains fixed at around 3 MB/s—it doesn’t scale up to use the full allowed range.

At this point, the issue is technically resolved and our production environment is stable, with many critical services running on this DRBD setup. However, your suggestion regarding --max-buffers caught our attention.

Would increasing --max-buffers potentially allow the sync controller to adjust speed more dynamically between the configured min and max rates (e.g. somewhere between 3M and 6M, instead of sticking to the minimum)?

Currently, our configuration uses --max-buffers=2048, and we’re considering increasing it to 8000.

Are there any other parameters we should consider tuning in conjunction with --max-buffers to further improve resync performance?

And most importantly: could increasing --max-buffers introduce any risk of service disruption, disk I/O issues, or instability on the primary node?

Here’s a snapshot of our current configuration for reference:

esource drbd0 {
    options {
        cpu-mask                ""; # default
        on-no-data-accessible   io-error; # default
        auto-promote            yes; # default
        peer-ack-window         4096s; # bytes, default
        peer-ack-delay          100; # milliseconds, default
        twopc-timeout           300; # 1/10 seconds, default
        twopc-retry-timeout     1; # 1/10 seconds, default
        auto-promote-timeout    20; # 1/10 seconds, default
        max-io-depth            8000; # default
        quorum                  off; # default
        on-no-quorum            suspend-io; # default
        quorum-minimum-redundancy       off; # default
    }
    _this_host {
        node-id                 1;
        volume 0 {
            device                      minor 0;
            disk                        "/dev/sdb1";
            meta-disk                   internal;
            disk {
                size                    0s; # bytes, default
                on-io-error             detach; # default
                disk-barrier            no; # default
                disk-flushes            yes; # default
                disk-drain              yes; # default
                md-flushes              yes; # default
                resync-after            -1; # default
                al-extents              1237; # default
                al-updates              yes; # default
                discard-zeroes-if-aligned       yes; # default
                disable-write-same      no; # default
                disk-timeout            0; # 1/10 seconds, default
                read-balancing          prefer-local; # default
                rs-discard-granularity  0; # bytes, default
            }
        }
    }
    connection {
        _peer_node_id 0;
        path {
            _this_host ipv4 192.168.1.9:7788;
            _remote_host ipv4 192.168.1.10:7788;
        }
        net {
            transport           ""; # default
            protocol            C; # default
            timeout             60; # 1/10 seconds, default
            max-epoch-size      2048; # default
            connect-int         10; # seconds, default
            ping-int            10; # seconds, default
            sndbuf-size         0; # bytes, default
            rcvbuf-size         0; # bytes, default
            ko-count            7; # default
            allow-two-primaries no; # default
            cram-hmac-alg       ""; # default
            shared-secret       ""; # default
            after-sb-0pri       disconnect; # default
            after-sb-1pri       disconnect; # default
            after-sb-2pri       disconnect; # default
            always-asbp         no; # default
            rr-conflict         disconnect; # default
            ping-timeout        5; # 1/10 seconds, default
            data-integrity-alg  ""; # default
            tcp-cork            yes; # default
            on-congestion       block; # default
            congestion-fill     0s; # bytes, default
            congestion-extents  1237; # default
            csums-alg           ""; # default
            csums-after-crash-only      no; # default
            verify-alg          ""; # default
            use-rle             yes; # default
            socket-check-timeout        0; # default
            fencing             dont-care; # default
            max-buffers         2048; # default
            allow-remote-read   yes; # default
            _name               "nodo2";
        }
        volume 0 {
            disk {
                resync-rate             3072k; # bytes/second
                c-plan-ahead            20; # 1/10 seconds, default
                c-delay-target          10; # 1/10 seconds, default
                c-fill-target           100s; # bytes, default
                c-max-rate              6144k; # bytes/second
                c-min-rate              3072k; # bytes/second
                bitmap                  yes; # default
            }
        }
    }
}

Thanks again for your help

It’s likely not an issue with tuning at present, but rather a bottleneck caused by the lower max-buffers value. I would advise upping the max-buffers value and then experiment with the tuning more. Let me quote from the man page:

max-buffers number

Limits the memory usage per DRBD minor device on the receiving side, or for internal buffers during resync or online-verify. Unit is PAGE_SIZE, which is 4 KiB on most systems. The minimum possible setting is hard coded to 32 (=128 KiB). These buffers are used to hold data blocks while they are written to/read from disk. To avoid possible distributed deadlocks on congestion, this setting is used as a throttle threshold rather than a hard limit. Once more than max-buffers pages are in use, further allocation from this pool is throttled. You want to increase max-buffers if you cannot saturate the IO backend on the receiving side.

Try cranking this up to 40k. I’ve even used 80k in production without any negative side-effects.

We applied the max-buffers change as suggested, increasing it from 2048 to 8192 on both nodes (secondary first, then primary) using:

drbdsetup net-options drbd0 0 --max-buffers=8192
drbdsetup net-options drbd0 1 --max-buffers=8192

At first, we observed a bit more fluctuation in the resync speed (as expected), but it has now settled to behave similarly to before with an average of around 3.4MB/s, even though we’ve allowed a c-min-rate of 4MB and c-max-rate of 8MB.

We’re reviewing other tuning parameters from the manpage you linked. We’ve already adjusted max-buffers, but we’d like to ask your opinion on the rest — whether it would make sense to tweak any of them further, or if there’s a specific combination you’ve seen work well together in your experience:

  • rs-discard-granularity
  • c-fill-target
  • sndbuf-size
  • rcvbuf-size

We were considering adjusting sndbuf-size and rcvbuf-size as suggested in the man page, do you recommend this?

Thanks again for all the help so far!

The specific combination of values I suggest is pretty much explained in the KB article I linked earlier.

  • resync-rate
    1/3 of the c-max-rate
  • c-max-rate
    100% or slightly higher than what your hardware can support
  • c-min-rate
    1/3 of c-max-rate
  • c-fill-target
    1M
  • max-buffers
    40k
  • sndbuf-size
    10M
  • rcvbuf-size
    10M