Linstor satellite upgrade hangs in postinst on proxmox 8.2.4

I’m not sure if this info will be helpful, but if anyone runs into the same issue I did…

I have a three node testing Proxmox 8.2.4 cluster that I’ve set up to play with and have previously successfully set up Linstor with HA controller. All working fine. I can reboot a controller and another node takes over, live migrate a test VM, etc.

I log in to the proxmox console today and see that there are some package updates for linstor common, controller, and satellite from 1.28.0-1 to 1.29.0-1.

Per the instructions in the manual, I started the upgrade on the active controller node first. When I run apt dist-upgrade it proceeds a while and then hangs on:

Setting up linstor-satellite (1.29.0-1) ...

Looking at the process table shows it is stuck in the following command:

/bin/sh /var/lib/dpkg/info/linstor-satellite.postinst configure 1.28.0-1

Ctrl-C won’t get me out of it. I rebooted the server and ran:

dpkg --configure -a

It ran with no issues and didn’t hang.

I repeated this same hang, reboot, dpkg–configure -a sequence on the second node. However on the third node (which had become the active controller after I rebooted the previous one) it didn’t hang. It just went right through. I rebooted it anyway.

Now everything is working again but this was a little bit concerning because I have a production cluster set up the same way. I only have one app on it so far so no big deal. Any ideas how I could debug this issue when upgrading the prod cluster so I could provide the info here?

Thanks

This is likely happening because of the following systemd override we suggest for proxmox:

$ cat /etc/systemd/system/linstor-satellite.service.d/override.conf
[Service]
Type=notify
TimeoutStartSec=infinity

What it means is that on start (or restart), the satellite service will wait until it is connected to a controller. This is generally useful for Proxmox, as Proxmox should only start after the storage system is fully operational.

However, this also means that during an apt upgrade, which will restart the satellite, the dpkg process will wait for the satellite to be connected to the controller again. But this requires the right version of the controller to be running. This might not always be the case.

So you have a few options to work around this issue:

  • Before the upgrade, manually disable the satellite service, only start them after everything is upgraded, including the controller:
    systemctl disable --now linstor-satellite # on all nodes
    apt dist-upgrade
    linstor controller version # Verify that the controller is running the latest version
    systemctl enable --now linstor-satellite # on all nodes
    
  • Upgrade the linstor-controller first. This will only work if the controller is running on a separate node, as otherwise the linstor-satellite will automatically be upgraded too, and you might run into the same issue
  • Temporarily remove the override of Type=notify, so the satellite will be considered started before the controller connects.

Just to add up, personally I’ve decoupled HA controller from the satellites. It runs on 3 dedicated VMs just for that purpose.

The sequence I follow before upgrading the controller is:

  • Disable its drbd-reactor managed resource on all nodes to prevent it from restarting the controller on a different node than the one that’s currently active.
  • Perform the upgrade on non-active nodes first.
  • Perform the upgrade on the last, active node.
  • Re-enable drbd-reactor managed resource.
  • Perform the upgrade on the satellite nodes.

Okay, thanks both for the feedback. I’ll consider alternate approaches to avoid the hang in the future.