Hi,
During heavy disk IO one of our raid sets experiences a write timeout (DID_TIME_OUT as reported by the kernel). This happens about once a week on our secondary node.
Next message in our logs is from kernel blk_update_request which reports this as an I/O error. This leads to DRBD saying “we had at least one MD IO ERROR during bitmap IO”. And then failing the disk and going into Diskless mode.
If there was a real IO error this is exactly what we want to happen. But this is a timeout, which should be handled differently.
Why does DRBD handle this as an IO error and not a disk timeout? How can we change this behaviour?
DRBD version 9.15
To recover from this we simply run: drbdadm attach
And DBRD resyncs for a bit and everything works again for a week or so.
Best regards,
Rami Lehti