Hey everyone!
I have this situation: A Linux cluster (Debian 12, DRBD 9.22) using DRDB giving me the following error message with the Linux host becoming unresponsive afterward. This affects different hardware setups with potentially different network layouts as well.
[Mon Feb 5 09:11:21 2024] INFO: task drbd_r_omd:1957 blocked for more than 120 seconds.
[Mon Feb 5 09:11:21 2024] Tainted: G OE 4.19.0-24-amd64 #1 Debian 4.19.282-1
[Mon Feb 5 09:11:21 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Feb 5 09:11:21 2024] drbd_r_omd D 0 1957 2 0x80000004
[Mon Feb 5 09:11:21 2024] Call Trace:
[Mon Feb 5 09:11:21 2024] __schedule+0x29f/0x840
[Mon Feb 5 09:11:21 2024] schedule+0x28/0x80
[Mon Feb 5 09:11:21 2024] io_schedule+0x12/0x40
[Mon Feb 5 09:11:21 2024] wbt_wait+0x19b/0x300
[Mon Feb 5 09:11:21 2024] ? trace_event_raw_event_wbt_step+0x120/0x120
[Mon Feb 5 09:11:21 2024] rq_qos_throttle+0x31/0x40
[Mon Feb 5 09:11:21 2024] blk_mq_make_request+0x111/0x530
[Mon Feb 5 09:11:21 2024] generic_make_request+0x1a4/0x400
[Mon Feb 5 09:11:21 2024] ? md_handle_request+0x119/0x190 [md_mod]
[Mon Feb 5 09:11:21 2024] submit_bio+0x45/0x130
[Mon Feb 5 09:11:21 2024] ? md_super_write.part.63+0x90/0x120 [md_mod]
[Mon Feb 5 09:11:21 2024] write_page+0x203/0x330 [md_mod]
[Mon Feb 5 09:11:21 2024] ? md_bitmap_wait_writes+0x93/0xa0 [md_mod]
[Mon Feb 5 09:11:21 2024] ? aa_sk_perm+0x31/0x120
[Mon Feb 5 09:11:21 2024] md_bitmap_unplug+0xa1/0x120 [md_mod]
[Mon Feb 5 09:11:21 2024] flush_bio_list+0x1c/0xd0 [raid1]
[Mon Feb 5 09:11:21 2024] raid1_unplug+0xb9/0xd0 [raid1]
[Mon Feb 5 09:11:21 2024] blk_flush_plug_list+0xcf/0x240
[Mon Feb 5 09:11:21 2024] blk_finish_plug+0x21/0x30
[Mon Feb 5 09:11:21 2024] drbd_recv_header_maybe_unplug+0x101/0x120 [drbd]
[Mon Feb 5 09:11:21 2024] drbd_receiver+0x146/0x2e5 [drbd]
[Mon Feb 5 09:11:21 2024] ? drbd_destroy_connection+0xb0/0xb0 [drbd]
[Mon Feb 5 09:11:21 2024] drbd_thread_setup+0x71/0x130 [drbd]
[Mon Feb 5 09:11:21 2024] kthread+0x112/0x130
[Mon Feb 5 09:11:21 2024] ? kthread_bind+0x30/0x30
[Mon Feb 5 09:11:21 2024] ret_from_fork+0x1f/0x40
I did some research and Red Hat says:
It’s necessary to engage Linbit (the DRBD vendor) to analyze the IO bottleneck in the block device, that is reaching the saturation limit.
I continued my research, but the bottom line seems to be: You have an IO bottleneck and need to fix it.
Now the problem is, that there are different hardware configurations, where this is running on, so I cannot upgrade the hardware easily, and I do not know how much throughput the network provides. So I need a way to fine-tune my DRBD configuration. There is an old entry on the Thomas Krenn Wiki (German), which provides ideas, but changing the IO scheduler seems to be outdated information, and I am not yet sure, if I want to touch sysctl
options affecting the memory stack.
So right now I am leaning towards tweaking my DRBD configuration. This is my current configuration:
resource omd {
disk {
resync-rate 33M;
c-fill-target 1M;
c-min-rate 35M;
c-max-rate 110M;
}
net {
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 1M;
rcvbuf-size 1M;
verify-alg crc32c;
}
Maybe someone here can point me to where I might be able to tweak things. I am looking at ko-count
, but could only find a KB article for DRBD Proxy (see the first comment, as I am a new user and can only include two link in this post ), and I am unsure if/how that applies to DRDB itself. Upon re-reading this and looking at my logs I am becoming unsure if this value will actually help.
I appreciate any input in this matter, thanks in advance!