Solved: Could not connect to any LINSTOR controller (after HA)

Despite the open thread Differences in DRBD resources I followed Creating a Highly Available LINSTOR Cluster and were quiet happy, that everything went smooth and checking status in between and at the end seems right. All satellite nodes online, both controllers comma separated in linstor-client.conf linstor controller which showed current machine - correlating to the state of drbd-reactor. The controller and DRBDreactor services were now installed and running on my PVE nodes, PVE-1 and PVE-2, PVE-1 being active at the moment.

As there were Proxmox Updates I installed them on one PVE node (PVE-2, IP: 192.168.113.22) and rebooted the node afterwards.

The DRBD resources did not come online after the reboot. All VMs and LXC can’t be brought online and can’t be migrated.

Proxmox shows error:

TASK ERROR: could not connect to any LINSTOR controller at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 241

Really weird, cause linstor commands on the failing node show:

root@pve-2:~# linstor controller which
linstor://192.168.113.21 (which is the other node, PVE-1 -> IMHO correct)

root@pve-2:~# linstor node list -p
+------------------------------------------------------------+
| Node    | NodeType  | Addresses                   | State  |
|============================================================|
| pve-1   | SATELLITE | 192.168.113.21:3366 (PLAIN) | Online |
| pve-2   | SATELLITE | 192.168.113.22:3366 (PLAIN) | Online |
| raspi-1 | SATELLITE | 192.168.111.20:3366 (PLAIN) | Online |
+------------------------------------------------------------+

Also other linstor commands like storage-pool, resource-group or resource show IMHO no failure. Only all DRBD resources (disks) that were formerly on the rebooted node show “Usage: Unused” and “State: UpToDate”.

root@pve-2:~# drbdsetup status pm-575f24e5
pm-575f24e5 role:Secondary
  disk:UpToDate open:no
  pve-1 role:Secondary
    peer-disk:UpToDate
  raspi-1 role:Secondary
    peer-disk:Diskless

# can even set this to primary on the rebooted node (PVE-2)

root@pve-2:~# drbdadm primary pm-575f24e5
root@pve-2:~# drbdsetup status pm-575f24e5
pm-575f24e5 role:Primary
  disk:UpToDate open:no
  pve-1 role:Secondary
    peer-disk:UpToDate
  raspi-1 role:Secondary
    peer-disk:Diskless

But the error remains. Half of my systems are down now and can’t be started.

Going through journalctl made obvious, that some processes in PVE were not happy with the new configuration. As said before the linstor tools themselves seemed OK.

Before setting up HA the controller ran on a LXC container. Now the controller service is installed on each PVE node, directly in the Hypervisor OS. This lead to different IP addresses than the one of the former LXC. I did change the old address to the two new IP addresses in linstor-client.conf, but forgot to change /etc/pve/storage.conf as well. This lead to PVE (and the LINSTOR PVE plugin) not reaching the controller.

After adding the new controller IP addresses to /etc/pve/storage.conf and restarting the PVE services I could start the LXCs and VMs on DRBD storage again.

Reference: Can’t make proxmox-linbit plugin to use multiple controller