we run a software which uses drdb (9.2.0) for HA in a productive, professional environment and our backup machine (SUSE Linux Enterprise Server 15 SP4) suddenly crashed one morning last week.
Since then the server only stays online, when drdb is not active.
When we active drdb it syncs up to 61% and then crashes again.
The support for the main application is not so helpful at the moment and it seems we have to dig down deep to solve.
I dont except anyone to come here, spend time and solve our problem (feel free tho if you can see the matrix and this is a piece of pie to you :D).
Iâm more looking for directions and resources on how to solve this in a proper way and dont run down too many rabbit holes (time is of matter).
Here are some chronological log entries from machine-a which could be interesting from the last kernel crash.
âPingAck did not arrive in timeâ comes up 3-5 times.
I am far from being an expert and I did not go through the whole log, but PingAck did not arrive in time sounds an awful lot, like the network link between your nodes is not healthy. Maybe that is a starting point for your investigation.
How are you?!. I got a few questions:
1-What is the type of connection between both servers?
2-Do you have any sketch to understand the layout?
3-Connection speed between these servers?
What happens if you run âethtool xxxâ (name of the network interface) on both?
4-Possibility of sharing the DRBD.conf of these servers?
5-Possibility of sharing the Cosorync.conf of these servers?
What you are seeing here may be related to a recent regression that has now been fixed in the most recent release candidates for 9.1.21 and 9.2.10.
Fix an out-of-bounds access when scanning the bitmap. It leads to a
crash when the bitmap ends on a page boundary, and this is also a
regression in 9.1.20.
The expected result of the bug that has now been fixed in the release candidate would be a page fault, and we can see that in your logs:
BUG: unable to handle page fault for address: 0000100000000008
The final releases with the fix are expected next week, but if this is the issue you may benefit from updating to the release candidate now.
Hi robin-checkmk, thank you for your reply
We looked at this one but at the moment it doesnt seem to be a problem there.
I found some more information on this and it can lead to bigger problems but it is not our priority yet.
This is very interesting information! We will discuss it on Monday.
How could it be that the machines in the cluster run for years without this error and then we cant get over it?
As far as I understand it this error should happen, when the bitmap structure is synchronised and not the data itself?
The regression was recent, so assuming you are performing regular updates to your systems including DRBD, this would have only become an issue now as features/fixes have been written that inadvertently impacted that logic.
This is when the bitmap is scanned, during a resync operation like what is shown in the logs this would occur so that is why it seems likely you are impacted by this bug. But thankfully, as mentioned prior, this has a fix already out in the latest release candidate, so if thatâs what we are seeing, it would no longer be a problem after the update.
Hi there!. Apologies for my delay in response, right now Iâm in Argentina GMT-3 time zone
THANKS for sharing!. Here is my ideas-suggestions:
Can you try adding the following to the net section of the affected resource and try resync?, what happend?:
net {
timeout 90;
ping-timeout 20;
ping-int 15;
connect-int 10;
}
IN THE CASE the issue continue:
What is the MTU configure in the network interfaces used by DRBD?
Can you share the output of âdrbdadm statusâ pls?
Can you share the âcorosync.logâ pls?.
Can you execute 'crm configure show > cib-export" and share the cib-export file?
Can you share what do you see executing âcrm_monâ
Can you please confirm that both servers are time-synchronized?
If you can, pls execute the supportconfig and send us the file for do a deeper check.
Hi all,
sorry for not responding the last weeks.
We are still working on the problem and it still is quite a journey.
Iâm on vacation soon and when I get back and the problem is hopefully solved Iâll give you an update.
Hi, friend! I have a similar problem, tell me at what point the stack started working normally. Also please let me know the versions of the packages you are currently running the cluster on.
Thank you!!!