We recently experienced an issue that I’d like to get some feedback on.
Our setup is a small Proxmox cluster with linstor-controller configured for HA using drbd-reactor. We had shut down the whole cluster for some maintenance on our networking infrastructure. When we powered up the nodes afterwards, the linstor_db volume failed to come up, thus also the controller and the whole cluster.
Specifically, drbdadm showed that the primary node was in Diskless mode, while all other nodes were Outdated. This is a logical conflict that prevented the resource from becoming available. The backing storage volume was fine and I spent some time trying to get the primary to become diskfull, but with no luck.
Finally, I force-promoted one of the outdated nodes to become primary, which allowed the linstor_db resource to become available, so the controller could start on that node and the satellites finally became online again. At this point the original primary node was in diskfull mode again. Also the force-promoted node was in StandAlone mode, but this was expected. Making the original primary node the primary again, then recreating the replica on the force-promoted node running toggle-disk twice solved all issues.
My questions:
- How could this logical deadlock occur?
- Is there a better way to solve this situation? Could I somehow have convinced the Diskless primary to become diskfull?