INFO: task drbd_r_omd:1957 blocked for more than 120 seconds

Hey everyone!

I have this situation: A Linux cluster (Debian 12, DRBD 9.22) using DRDB giving me the following error message with the Linux host becoming unresponsive afterward. This affects different hardware setups with potentially different network layouts as well.

[Mon Feb  5 09:11:21 2024] INFO: task drbd_r_omd:1957 blocked for more than 120 seconds.
[Mon Feb  5 09:11:21 2024]       Tainted: G           OE     4.19.0-24-amd64 #1 Debian 4.19.282-1
[Mon Feb  5 09:11:21 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Feb  5 09:11:21 2024] drbd_r_omd      D    0  1957      2 0x80000004
[Mon Feb  5 09:11:21 2024] Call Trace:
[Mon Feb  5 09:11:21 2024]  __schedule+0x29f/0x840
[Mon Feb  5 09:11:21 2024]  schedule+0x28/0x80
[Mon Feb  5 09:11:21 2024]  io_schedule+0x12/0x40
[Mon Feb  5 09:11:21 2024]  wbt_wait+0x19b/0x300
[Mon Feb  5 09:11:21 2024]  ? trace_event_raw_event_wbt_step+0x120/0x120
[Mon Feb  5 09:11:21 2024]  rq_qos_throttle+0x31/0x40
[Mon Feb  5 09:11:21 2024]  blk_mq_make_request+0x111/0x530
[Mon Feb  5 09:11:21 2024]  generic_make_request+0x1a4/0x400
[Mon Feb  5 09:11:21 2024]  ? md_handle_request+0x119/0x190 [md_mod]
[Mon Feb  5 09:11:21 2024]  submit_bio+0x45/0x130
[Mon Feb  5 09:11:21 2024]  ? md_super_write.part.63+0x90/0x120 [md_mod]
[Mon Feb  5 09:11:21 2024]  write_page+0x203/0x330 [md_mod]
[Mon Feb  5 09:11:21 2024]  ? md_bitmap_wait_writes+0x93/0xa0 [md_mod]
[Mon Feb  5 09:11:21 2024]  ? aa_sk_perm+0x31/0x120
[Mon Feb  5 09:11:21 2024]  md_bitmap_unplug+0xa1/0x120 [md_mod]
[Mon Feb  5 09:11:21 2024]  flush_bio_list+0x1c/0xd0 [raid1]
[Mon Feb  5 09:11:21 2024]  raid1_unplug+0xb9/0xd0 [raid1]
[Mon Feb  5 09:11:21 2024]  blk_flush_plug_list+0xcf/0x240
[Mon Feb  5 09:11:21 2024]  blk_finish_plug+0x21/0x30
[Mon Feb  5 09:11:21 2024]  drbd_recv_header_maybe_unplug+0x101/0x120 [drbd]
[Mon Feb  5 09:11:21 2024]  drbd_receiver+0x146/0x2e5 [drbd]
[Mon Feb  5 09:11:21 2024]  ? drbd_destroy_connection+0xb0/0xb0 [drbd]
[Mon Feb  5 09:11:21 2024]  drbd_thread_setup+0x71/0x130 [drbd]
[Mon Feb  5 09:11:21 2024]  kthread+0x112/0x130
[Mon Feb  5 09:11:21 2024]  ? kthread_bind+0x30/0x30
[Mon Feb  5 09:11:21 2024]  ret_from_fork+0x1f/0x40

I did some research and Red Hat says:

It’s necessary to engage Linbit (the DRBD vendor) to analyze the IO bottleneck in the block device, that is reaching the saturation limit.

I continued my research, but the bottom line seems to be: You have an IO bottleneck and need to fix it.
Now the problem is, that there are different hardware configurations, where this is running on, so I cannot upgrade the hardware easily, and I do not know how much throughput the network provides. So I need a way to fine-tune my DRBD configuration. There is an old entry on the Thomas Krenn Wiki (German), which provides ideas, but changing the IO scheduler seems to be outdated information, and I am not yet sure, if I want to touch sysctl options affecting the memory stack.

So right now I am leaning towards tweaking my DRBD configuration. This is my current configuration:

resource omd {
  disk {
    resync-rate   33M;
    c-fill-target 1M;
    c-min-rate    35M;
    c-max-rate    110M;
  } 
  net {
    max-buffers    8000;
    max-epoch-size 8000;
    sndbuf-size    1M;
    rcvbuf-size    1M;

    verify-alg crc32c;
  }

Maybe someone here can point me to where I might be able to tweak things. I am looking at ko-count, but could only find a KB article for DRBD Proxy (see the first comment, as I am a new user and can only include two link in this post :upside_down_face:), and I am unsure if/how that applies to DRDB itself. Upon re-reading this and looking at my logs I am becoming unsure if this value will actually help.

I appreciate any input in this matter, thanks in advance!

KB article for DRBD Proxy

Hello,

There are definitely some tunings that can be done for DRBD, but I’ll echo here what is stated in a knowledgebase article I will link here and recommend for your situation: (Troubleshooting DRBD Performance Issues) DRBD can only run as fast as the slowest system component in terms of bandwidth and latency. So figuring out where the bottleneck is in your system first will be the most productive step towards achieving better performance. I would recommend you go through the steps in that KB article first to get a feel of where you are at on that front.

I’d also reccommend upgrading to the latest version of DRBD if you haven’t already (I saw in a different thread you were actually using the 8.4 version of DRBD with the 9.22 utils); lots of development has gone into performance optimization over the past few years, as well as features such as TCP load balancing, which can help out if you are limited by your network speed as you’ll be able to balance the load across multiple TCP sockets: Load Balanced Replication with DRBD - LINBIT

The information you’ve found so far does seem to be quite old as it references DRBD 0.7x, and it may not be applicable to modern systems. I’ll suggest this KB article that provides some information on tunings you can do to DRBD’s resync controller, with explanation of what is doing what and why: Tuning DRBD's Resync Controller

Knowledge of (and potentially benchmarking) your hardware specs will be important here to understand what you should tweak on these settings, as you run the risk of competing with your system application I/O otherwise. That’s another reason why I’d suggest going through the first article I linked here before going down this path, but once you do you will be armed with the information needed to make the best decisions for your DRBD configuration.

1 Like

Thanks for the extensive and elaborate answer! :pray:

I was afraid, there were no silver bullets, but thanks for confirming that my Plan B is actually Plan A. :wink:

We use the Kernel modules (8.4.11) delivered with Debian 12 and Kernel 6.1.90.
Is there a recommendation towards if and how one should upgrade the Kernel module?
I would like to stick to the distribution defaults, but if there is a well-known, widely-used way, I might consider it.

LINBIT provides binary packages of the latest versions to commercial support customers, and maintains repositories of these packages accessible to clients that allow clients to upgrade via their preferred package managers. You can find more information about that here: https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-linbit-packages

If you’d like to do this yourself instead, building from source is always an option, either via checking out the sources via the public Git repository or using the source tar files from https://pkg.linbit.com/. Information to help you get started on that is here: https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-from-source

It varies based on the distribution provider what version of prebuilt DRBD binary they provide, and this may lag behind DRBD release cycles. But if other distributions may be an option for you, Ubuntu LTS offers a PPA repository for DRBD 9: LINBIT DRBD9 Stack : “LINBIT” team

1 Like

Thanks again for the extensive information! :pray:
I think that concludes this thread for me for now. :white_check_mark:

I would mark the first reply as the solution to this thread, but I can’t. Not sure if that is a permission thing, or if you are just not using this feature.

Hello Robin,

what EXACTLY was the solution?

Can you tell me/us?

Jörg

Hi Jörg,

frankly: We never found a solution in that sense. Currently, it looks like the issue actually only was present on Debian 10 and as of now it has not re-surfaced on Debian 12. We still have some systems pending, but it looks like the updated version of the DRBD stack has at least helped.

I know, that is very unsatisfying, but for now it is all we got. I will update here, if any new insights come about. :v: