Drdb kickstart troubleshooting

Hi folks,

we run a software which uses drdb (9.2.0) for HA in a productive, professional environment and our backup machine (SUSE Linux Enterprise Server 15 SP4) suddenly crashed one morning last week.

Since then the server only stays online, when drdb is not active.
When we active drdb it syncs up to 61% and then crashes again.

The support for the main application is not so helpful at the moment and it seems we have to dig down deep to solve.
I dont except anyone to come here, spend time and solve our problem (feel free tho if you can see the matrix and this is a piece of pie to you :D).
I’m more looking for directions and resources on how to solve this in a proper way and dont run down too many rabbit holes (time is of matter).

Here are some chronological log entries from machine-a which could be interesting from the last kernel crash.

“PingAck did not arrive in time” comes up 3-5 times.

drbd opt machine-b: Handshake to peer 0 successful: Agreed network protocol version 121

drbd opt machine-b: Feature flags enabled on protocol level: 0x1f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
drbd opt/0 drbd1 machine-b: self 6EDE8CE3E93A46CC:0000000000000000:5F9FA27419C8245C:7C6DF08A9A54E22E bits:6253779 flags:24
drbd opt/0 drbd1 machine-b: peer D089FC1EDA57430D:6EDE8CE3E93A46CD:9870D4109E88E9A8:EE0D5D92B6553980 bits:223296721 flags:1120


drbd opt/0 drbd1 machine-b: helper command: /sbin/drbdadm before-resync-target
drbd opt/0 drbd1 machine-b: helper command: /sbin/drbdadm before-resync-target exit code 0
drbd opt/0 drbd1 machine-b: repl( WFBitMapT -> SyncTarget )
drbd opt/0 drbd1 machine-b: Began resync as SyncTarget (will sync 893738976 KB [223434744 bits set]).
drbd opt machine-b: PingAck did not arrive in time.
drbd opt machine-b: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown )
drbd opt/0 drbd1 machine-b: pdsk( UpToDate -> DUnknown ) repl( SyncTarget -> Off )
drbd opt machine-b: error receiving P_RS_DATA_REPLY, e: -5 l: 262144!
drbd opt machine-b: Terminating sender thread
drbd opt machine-b: Starting sender thread (from drbd_r_opt [645])
drbd opt machine-b: Connection closed
drbd opt machine-b: helper command: /sbin/drbdadm disconnected
drbd opt machine-b: helper command: /sbin/drbdadm disconnected exit code 0
drbd opt machine-b: conn( NetworkFailure -> Unconnected )
drbd opt machine-b: Restarting receiver thread
drbd opt machine-b: conn( Unconnected -> Connecting )
drbd opt machine-b: Handshake to peer 0 successful: Agreed network protocol version 121


drbd opt machine-b: Committing remote state change 417992716 (primary_nodes=1)
drbd opt machine-b: conn( Connecting -> Connected ) peer( Unknown -> Primary )
drbd opt/0 drbd1 machine-b: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
drbd opt/0 drbd1 machine-b: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 76861(19), total 76861; compression: 100.0%
drbd opt/0 drbd1 machine-b: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 76861(19), total 76861; compression: 100.0%
drbd opt/0 drbd1 machine-b: helper command: /sbin/drbdadm before-resync-target
drbd opt/0 drbd1 machine-b: helper command: /sbin/drbdadm before-resync-target exit code 0
drbd opt/0 drbd1 machine-b: repl( WFBitMapT -> SyncTarget )
drbd opt/0 drbd1 machine-b: Began resync as SyncTarget (will sync 345303048 KB [86325762 bits set]).
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -53 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -117 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -181 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -181 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -117 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -17 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -81 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -145 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -181 < 0
drbd opt machine-b: ASSERTION FAILED: pp_in_use: -181 < 0
BUG: unable to handle page fault for address: 0000100000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
PREEMPT SMP PTI

Have a great day :wave:

1 Like

I am far from being an expert and I did not go through the whole log, but PingAck did not arrive in time sounds an awful lot, like the network link between your nodes is not healthy. Maybe that is a starting point for your investigation. :slight_smile:

How are you?!. I got a few questions:
1-What is the type of connection between both servers?
2-Do you have any sketch to understand the layout?
3-Connection speed between these servers?
What happens if you run “ethtool xxx” (name of the network interface) on both?
4-Possibility of sharing the DRBD.conf of these servers?
5-Possibility of sharing the Cosorync.conf of these servers?

Have a great one!.
Best regards from Argentina

What you are seeing here may be related to a recent regression that has now been fixed in the most recent release candidates for 9.1.21 and 9.2.10.

  • Fix an out-of-bounds access when scanning the bitmap. It leads to a
    crash when the bitmap ends on a page boundary, and this is also a
    regression in 9.1.20.

The expected result of the bug that has now been fixed in the release candidate would be a page fault, and we can see that in your logs:

BUG: unable to handle page fault for address: 0000100000000008

The final releases with the fix are expected next week, but if this is the issue you may benefit from updating to the release candidate now.

Hi robin-checkmk, thank you for your reply :slight_smile:
We looked at this one but at the moment it doesnt seem to be a problem there.
I found some more information on this and it can lead to bigger problems but it is not our priority yet.

1 Like

Hi wlucero, thank you for your reply :slight_smile:
I’m good so far except for this annoying situation :wink: And you?

  1. LAN in different sites
  2. You mean the network layout? I could dig around and find it but I cant post it here :wink:
  3. 10GB

drbd global_common.conf

global {
  # don't send message back to linbit
  usage-count no;
}
common {
  net {
    protocol C;
  }
}

drbd opt.conf

resource opt {
  device    /dev/drbd1;
  disk      /dev/mapper/systemvg-foolv;
  meta-disk /dev/systemvg/foometalv;
  net {
    max-buffers 128K;
    max-epoch-size 16K;
    sndbuf-size 0;
    csums-alg sha1;
    verify-alg sha1;
    after-sb-0pri discard-least-changes;
    fencing resource-only;
  }
  disk {
    c-plan-ahead 0;
    c-fill-target 512K;
    resync-rate 1071M;

  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh";
    unfence-peer "/usr/lib/drbd/crm-unfence-peer.9.sh";
  }
  on machine-a {
    address ipv4 192.168.0.110:7789;
  }
  on machine-b {
    address ipv4 192.168.0.111:7789;
  }
}

corosync.conf

totem {
 version: 2
 secauth: off
 cluster_name: cluster
 transport: knet
 rrp_mode: passive
 token: 5000
 ip_version: ipv4

}
nodelist {
 node {
 name: machine-a
 ring0_addr: 192.168.0.10
 ring1_addr: 192.168.0.110
 nodeid: 1
 }
 node {
 name: machine-b
 ring0_addr: 192.168.0.11
 ring1_addr: 192.168.0.111
 nodeid: 2
 }
}
quorum {
 provider: corosync_votequorum
 two_node: 1
}
logging {
 to_logfile: yes
 logfile: /var/log/cluster/corosync.log
 to_syslog: yes
}
1 Like

Hi liniac, thank you for your reply :slight_smile:

This is very interesting information! We will discuss it on Monday.

How could it be that the machines in the cluster run for years without this error and then we cant get over it?
As far as I understand it this error should happen, when the bitmap structure is synchronised and not the data itself?

1 Like

The regression was recent, so assuming you are performing regular updates to your systems including DRBD, this would have only become an issue now as features/fixes have been written that inadvertently impacted that logic.

This is when the bitmap is scanned, during a resync operation like what is shown in the logs this would occur so that is why it seems likely you are impacted by this bug. But thankfully, as mentioned prior, this has a fix already out in the latest release candidate, so if that’s what we are seeing, it would no longer be a problem after the update.

Hi there!. Apologies for my delay in response, right now I’m in Argentina GMT-3 time zone :slight_smile:
THANKS for sharing!. Here is my ideas-suggestions:

  1. Can you try adding the following to the net section of the affected resource and try resync?, what happend?:
    net {
    timeout 90;
    ping-timeout 20;
    ping-int 15;
    connect-int 10;
    }
    IN THE CASE the issue continue:
  2. What is the MTU configure in the network interfaces used by DRBD?
  3. Can you share the output of ‘drbdadm status’ pls?
  4. Can you share the “corosync.log” pls?.
  5. Can you execute 'crm configure show > cib-export" and share the cib-export file?
  6. Can you share what do you see executing ‘crm_mon’
  7. Can you please confirm that both servers are time-synchronized?
  8. If you can, pls execute the supportconfig and send us the file for do a deeper check.

Have a great Friday!.
Best regards.,

Hi all,
sorry for not responding the last weeks.
We are still working on the problem and it still is quite a journey.
I’m on vacation soon and when I get back and the problem is hopefully solved I’ll give you an update.

Thanks everyone for their support! :slight_smile: