High CPU Usage on drbd_s_writecac during benchmarking

When benchmarking to determine the viability of using drbd for a solution I’ve found that I can’t hit my raw disk performance due to the drbd_s_writecac process ( which appears to buffer the changes for outbound replication ) hitting the limits of a single core…

Benchmarking the local diskset I’m seeing ~600k IOPS and when placing behind DRBD ( with and without remote replica ) and benching I get ~190 IOPS.

Raw Disks:
read: IOPS=631k, BW=4929MiB/s (5169MB/s)(32.0GiB/6645msec)
write: IOPS=631k, BW=4933MiB/s (5173MB/s)(32.0GiB/6645msec); 0 zone resets

DRBD:
read: IOPS=180k, BW=1408MiB/s (1476MB/s)(8438MiB/5994msec)
write: IOPS=180k, BW=1406MiB/s (1475MB/s)(8429MiB/5994msec); 0 zone resets

I’ve tried every combination of settings I could imagine would have any impact including removing everything that wasn’t absolutely necessary, I’m at a loss as it seems like this may be a design issue with the cache/buffer process.

DRBD version: 9.2.10 (api:2/proto:86-122)

fio --name=fiotest --filename=/mnt/test/test --size=4Gb --rw=randrw --bs=8K --direct=1 --numjobs=32 --ioengine=libaio --iodepth=32 --group_reporting --runtime=60

resource XXYY {

    volume 0 {
           device minor 1;
           disk /dev/md/XXYY;
           meta-disk internal;
    }

    on XXYY-a.local {
            node-id 0;
    }
    on XXYY-b.local {
            node-id 1;
    }

    disk {
            c-plan-ahead 0;
            al-updates no;
    }

    connection {
            path {
                    host XXYY-a.local address 123.123.123.1:7789;
                    host XXYY-b.local address 123.123.123.2:7789;
            }
            path {
                    host XXYY-a.local address 123.123.123.3:7789;
                    host XXYY-b.local address 123.123.123.4:7789;
            }
            net {
                    transport rdma;
                    protocol A;
                    max-buffers 128k;
                    sndbuf-size 2048k;
                    rcvbuf-size 2048k;
            }
    }

}

Any advice is welcome, thank you!

I believe you essentially have it figured out. You’re hitting the limitation of a single core. I tend to see IOPS max out around 350k when testing myself, but I test with 4k block size. You’re getting a little over half that with 8k blocks.

While it might not be the most graceful solution, I have seen people divide the storage into multiple DRBD resources, to be able to have multiple threads, and then aggregate these together using something like LVM.

1 Like

I appreciate your commentary, I was doing a last ditch effort here on the forums to see if there was something obvious I was missing. The solution you outlined I considered ( although as I’m not intimate with the innards not sure if it would experience the same issue, but figured it was worth a try anyway, so good to know it likely will give the result I want ) but was avoiding it as it adds complexity and overhead. Fixing this kind of issue usually is a heavy handed thing, so I doubt it will be resolved anytime soon, but good to know others have encountered a similar headwind.