Trouble during cluster cold start: linstor_db Diskless + Outdated

We recently experienced an issue that I’d like to get some feedback on.

Our setup is a small Proxmox cluster with linstor-controller configured for HA using drbd-reactor. We had shut down the whole cluster for some maintenance on our networking infrastructure. When we powered up the nodes afterwards, the linstor_db volume failed to come up, thus also the controller and the whole cluster.

Specifically, drbdadm showed that the primary node was in Diskless mode, while all other nodes were Outdated. This is a logical conflict that prevented the resource from becoming available. The backing storage volume was fine and I spent some time trying to get the primary to become diskfull, but with no luck.

Finally, I force-promoted one of the outdated nodes to become primary, which allowed the linstor_db resource to become available, so the controller could start on that node and the satellites finally became online again. At this point the original primary node was in diskfull mode again. Also the force-promoted node was in StandAlone mode, but this was expected. Making the original primary node the primary again, then recreating the replica on the force-promoted node running toggle-disk twice solved all issues.

My questions:

  • How could this logical deadlock occur?
  • Is there a better way to solve this situation? Could I somehow have convinced the Diskless primary to become diskfull?

Without logs or further information, I can only guess as to what happened.

My guess at this point would be that something preventing the disk from attaching at startup. I would be curious if an attempt to reconnect to the disk might have resolved things. If you run into this issue again, try a quick drbdadm adjust <res> to force the resource to try to attach to the disk again.