DRBD processes blocked -RHEL 8.10 and kmod DRBD 9.1.23

(I have also reported this in Github here )

Hi, we are using DRBD kmod:

Filename: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/drbd90/drbd.ko
alias: block-major-147-*
license: GPL
version: 9.1.23
description: drbd - Distributed Replicated Block Device v9.1.23
author: Philipp Reisner phil@linbit.com, Lars Ellenberg lars@linbit.com
rhelversion: 8.10
srcversion: 71B0FFDA4762D87BA60C54C
depends: libcrc32c
name: drbd
vermagic: 4.18.0-553.27.1.el8_lustre.x86_64 SMP mod_unload modversions
parm: enable_faults:int
parm: fault_rate:int
parm: fault_count:int
parm: fault_devs:int
parm: disable_sendpage:bool
parm: allow_oos:DONT USE! (bool)
parm: minor_count:Approximate number of drbd devices (1U-255U) (uint)
parm: usermode_helper:string
parm: protocol_version_min:
Reject DRBD dialects older than this.
Supported: DRBD 8 [86-101]; DRBD 9 [118-121].
Default: 86 (drbd_protocol_version)

to replicate Lustre v2.15.6 metadata partitions to a secondary MDS servers for metadata redundancy.

sdb 8:16 0 17.5T 0 disk
├─sdb1 8:17 0 1.8T 0 part
│ └─drbd0 147:0 0 1.8T 0 disk /lustre/mgs
├─sdb2 8:18 0 3.7T 0 part
│ └─drbd1 147:1 0 3.7T 0 disk /lustre/mdt0
└─sdb3 8:19 0 3.7T 0 part
└─drbd2 147:2 0 3.7T 0 disk /lustre/mdt1

We have compiled these versions from sources against the patched Lustre kernel 4.18.0-553.27.1.el8_lustre.x86_6 for RHEL 8, no other modifications and/or exotic patches in the kernel. DRBD utils is on 9.31.0-1.el8, also compiled from source. We compiled from sources because the proper I am about to describe also exists in the latest ELREPO releases that we exhibit exactly the same problematic behaviour (kmod-drbd90.x86_64 9.1.23-1.el8_10.elrepo elrepo ). So, we have decided to compile from sources but yet, we were not able to resolve the problem.

WHAT IS THE PROBLEM:
At intervals ranging from 7-20 days of continuous DRBD enabled MDG/MDT operation, we have suddenly blocked DRBD processes. It has happened many times now, so it is fairly reproducible, although the frequency of occurrence is not constant. The exact error message from the system journal is the following:

Jul 17 07:24:19 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task ldlm_cn19_000:17761 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task mdt21_011:19591 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task mdt21_010:19584 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task mdt21_009:19574 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task mdt21_008:19564 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task jbd2/drbd1-8:17843 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task ldlm_cn19_002:17763 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task ldlm_cn19_000:17761 blocked for more than 120 seconds.

At that point, the DRBD processes remain blocked, until we powercycle the server. It is not possible to cleanly unmount the metadata target of the blocked drbd1 device (mdt0). We can umount the mgs (drbd0) and the mdt1 (drbd2). Hence the need to powercycle the server.

Our config files are given below:

cat /etc/drbd.d/global_common.conf
global {
        usage-count yes;
        # minor-count dialog-refresh disable-ip-verification
        # cmd-timeout-short 5; cmd-timeout-medium 121; cmd-timeout-long 600;
}

common {
        handlers {
                # These are EXAMPLE handlers only.
                # They may have severe implications,
                # like hard resetting the node under certain circumstances.
                # Be careful when chosing your poison.

                # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
                # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                 split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                 out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
                # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
                # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
        }

        startup {
                wfc-timeout 60; 
                degr-wfc-timeout 30;
                outdated-wfc-timeout 30; 
                # wait-after-sb-
        }

        options {
                # cpu-mask on-no-data-accessible
        }


        disk {
                # size on-io-error fencing disk-barrier disk-#flushes
                # disk-drain md-flushes
                # resync-rate 3830M;
                # resync-after al-#extents
                # c-plan-ahead c-delay-target c-fill-target 
                c-max-rate 3950M;
                c-min-rate 4M; 
                c-plan-ahead 22;
		# disk-timeout
        }




#net {
#        # max-epoch-size          20000;
#        max-buffers             36k;
#        sndbuf-size            1024k ;
#        rcvbuf-size            2048k;
#}


        net {
                # protocol timeout max-epoch-size max-buffers unplug-watermark
                # connect-int ping-int sndbuf-size rcvbuf-size ko-count
                # allow-two-primaries 
                cram-hmac-alg "sha1"; 
                shared-secret "ZZZZZZZ";
                after-sb-0pri discard-least-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri call-pri-lost-after-sb;
                #always-asbp 
                rr-conflict call-pri-lost;
                # ping-timeout data-integrity-alg tcp-cork on-congestion
                max-epoch-size          20000;
                max-buffers      	131072;
                # congestion-fill congestion-extents csums-alg 
                verify-alg md5;
                # use-rle
        }



####################
cat /etc/drbd.d/clusterdb.res 
resource mdt0 {
 on lustrea-mds-nx10077836-an.int.met.no {
 device /dev/drbd1;
 disk /dev/sdb2;
# address 10.8.0.2:7777;
 address 192.168.160.1:7777;
 meta-disk internal;
 }


 on lustrea-mds-nx10077839-am.int.met.no {
 device /dev/drbd1;
 disk /dev/sdb2;
# address 10.8.0.1:7777;
 address 192.168.160.2:7777;
 meta-disk internal;
 }

}

resource mdt1 {
 on lustrea-mds-nx10077836-an.int.met.no {
 device /dev/drbd2;
 disk /dev/sdb3;
# address 10.8.0.2:7788;
 address 192.168.160.1:7788;
 meta-disk internal;
 }


 on lustrea-mds-nx10077839-am.int.met.no {
 device /dev/drbd2;
 disk /dev/sdb3;
# address 10.8.0.1:7788;
 address 192.168.160.2:7788;
 meta-disk internal;
 }

}

resource mgs {
 on lustrea-mds-nx10077836-an.int.met.no {
 device /dev/drbd0;
 disk /dev/sdb1;
# address 10.8.0.2:7789;
 address 192.168.160.1:7789;
 meta-disk internal;
 }


 on lustrea-mds-nx10077839-am.int.met.no {
 device /dev/drbd0;
 disk /dev/sdb1;
# address 10.8.0.1:7789;
 address 192.168.160.2:7789;
 meta-disk internal;
 }

}

We exclude a hardware or network issue, as the issue described above happens on different server hardware and network equipment that has been checked thoroughly with diagnostics and latest firmware updates included. We would thus like help to pinpoint and fix the issue in RHEL 8.10. We cannot upgrade to RHEL 9 yet to try kmod DRBD v9.2, so we have to use 9.1.x.

Thanks for any help and pointers to fix this.

Best regards,
GM