Failover to wrong node

I set up a three node Proxmox cluster (prox1, prox2, prox3). On this cluster, I created a linstor resource group with a mirrored replication (place-count=2).

I spun up a new VM on prox2 and indeed, the VM disk was replicated to prox3. Perfect.

But when simulating a crash of the VM node, I was very surprised to realize that the VM failed over not to the node with the replicated disk, prox3, but instead to prox1.

The good thing, of course, is that I again have a mirrored disk but the failover took much longer because the disk first had to be created. It would have been much more efficient if it failed over to prox3 first and only then started creating a new replica on prox1.

How can I set this up correctly?

The short version: the plugin has 0 influence where a VM is started/failed over to.

The longer version:
At VM creation time the plugin tells LINSTOR to create a replicated volume of size X with a replica count of C, but where to place them is up to LINSTOR. Actually, there is some code in place that tries to create a DRBD disk on local storage on that node the VM is created to make sure (if possible) that the VM has local storage.

At VM start the plugin checks if it has access to the DRBD disk and if not it creates a diskless assignment. If the VM is stopped and it was a diskless assignment it gets deleted.

So the VM was started by Proxmox on whatever node it decided, prox1 in this case, and created a diskless assignment. That happens instantly and the disk should be usable instantly, there is no sync. Side note: even a diskful assignment would be instantly accessible, even if there is a resync in progress.

In conclusion: there should not be too much overhead because a diskless assignment is created, but yes, the diskless assignment needs to read data over the network while if started on prox3 it would have had local data, but there is no way that we as a plugin could influce that.

2 Likes

Thanks, @rck, that’s very helpful!

Just to make sure I got it right, in the case I described, the following things happened:

  1. prox2 failed
  2. The VM was failed over by Proxmox to prox1
  3. On prox1, there is no DRBD disk so a diskless assignment is created (by the Linstor Proxmox plugin) and the VM is started

Which leaves me with the following questions:

  • At what moment does Linstor start the replication to prox1 (which is the only node left in order to create a mirror)? In other words, is this decision influenced by Proxmox via the Linstor plugin or does Linstor make this decision independently?
  • Assuming I had more nodes available, if I understood you correctly, Linstor would still prefer to do the replication on the node where the VM is running now, correct?
  • Will Proxmox, with help of the Linstor plugin, switch from diskless to diskful automatically as soon as the resync is done so that data can be accessed locally?

yes, that is basically what IMO happend. I’d even assume that prox1 already had a diskless resource in place. In your scenario LINSTOR would already have created a diskless tie-breaker resource on prox1 for quorum purposes to begin with. This existing diskless resource was then reused.

The plugin has no influence on that at all. There are various LINSTOR settings that might influence that. In the short run LINSTOR does not do anything at all, the failed node might come back, reconnect, sync the delta and be happy again. AFAIK after some configurable time it considers the node dead and then after some time a task kicks in that tries to re-establish the replica count. Assuming enough storage nodes this could be a different node than where the resource is currently diskless Primary.

Then there is also a setting which is by default off IIRC that automatically converts diskless Primaries after some configurable time to diskful copies if possible. That might be helpful if one really prefers local storage. This also removes unnecessary extra copies if needed.

no, not necessarily, see the answer to the previous question. if LINSTOR is free to chose a node it is usually the one with most free space.

Once more, Proxmox or the plugin don’t trigger anything in that regard, LINSTOR itself might (see what I wrote about auto-diskfull). Then the rest is done by DRBD itself. When LINSTOR triggers that, the DRBD resource is updated from disk none to a resource that has a backing device and then DRBD does what DRBD does, it re-syncs the data to the local backing device, while that is running the disk is perfectly usable, and data it currently does not have are read from a peer over the network that has this data.

I understand this now, thanks for the explanation.

When you write

In the short run LINSTOR does not do anything at all, the failed node might come back, reconnect, sync the delta and be happy again.

I assume you mean that this is handled on a lower level, i.e., by DRBD directly.

AFAIK after some configurable time it considers the node dead and then after some time a task kicks in that tries to re-establish the replica count.

It would be interesting to better understand what configuration this is. Are you maybe referring to ping-timeout which would lead to a connection state of Unconnected (as per DRBD - Connection States)? Or are you referring to Linstor’s Auto-Evict feature?

Unless I’m missing something, this does sound like a configuration that should be generally preferred. Yesterday, I watched the presentation @phil_reisner gave to the folks at CloudStack and the local access was one of the main selling points of DRBD he mentioned.

I assume this can be set up using Auto-Diskful and Related Options.

I’m just wondering how such a setting will act in a cluster where we have both diskful and (intentionally, i.e. compute-only) diskless nodes. Will it break things or will it just ignore the setting if the node is diskless?