DRBD + KeepaliveD notify_* scripts

We love to keep things simple, and for us KeepaliveD offers a simple way to create a high-availability floating IP address. We use it a lot for things like load balancers and SMTP servers.

Now we would like to add a DRBD resource that should be mounted on the active node. We got it to work in a failover situation with a simple:

drbdadm primary xvdb
mount /dev/drbd1 /drbd

in the notify_master script on the secondary node.

What we haven’t yet figured out is how to fail back gracefully when the primary comes back online. It seems like we’ll need a bit of co-ordination between the nodes to get the sequence right:

  1. Wait for DRBD to resync
  2. Unmount DRBD on backup node
  3. Fail over IP and mount DRBD on primary node

It seems to me like if we could somehow configure a dependency in the keepalived systemd unit so it doesn’t start until the resync is complete, that should do the trick.

Has anyone tackled this already, so we don’t have to re-invent the wheel? Or is there really a strong case for using DRBD Reactor and/or LINSTOR instead of or in addition to KeepaliveD?

1 Like

I don’t think automatic fail back is wise. If there is a memory issue for example that causes the node to reboot or another hardware failure that causes random reboots, it will cause more problems/downtime than not.

I’d recommend giving Reactor a try, LINBIT (we) designed it specifically for reducing complexity in 3 node cluster setups. There is a tech guide here.

Is there a reason you’re intersted in failing back automatically?

1 Like

Thanks for your reply!

It seems we were trying to solve a problem that didn’t need to be solved. Your suggestion is a good one: when failover happens, don’t fail back. Once the failed server is back online and has resynced, it becomes the new backup.

I downloaded the guide you referenced to see how you recommend setting it up. It wasn’t entirely obvious at first read that this is how it works with drbd-reactor and OCF resource agent, but I assume so.

I think we can achieve the same result as what you describe by using nopreempt and setting state to BACKUP on both nodes of our KeepaliveD cluster.