Proxmox: TASK ERROR: API Return-Code: 500. Message: Could not rollback cluster wide snapshot

Hi Community

I built a Proxmox cluster with Linstor in my homelab and I’m currently in the experimental phase. So it’s not so serious if something breaks. Currently a 2-node cluster with quorum, as one node was DOA.

Installed software:

Summary
root@pve01 ~$ dpkg -l | grep linstor
linstor-client                       1.24.0-1
linstor-common                       1.30.4-1
linstor-controller                   1.30.4-1
linstor-proxmox                      8.1.0-1
linstor-satellite                    1.30.4-1
python-linstor                       1.24.0-1

root@pve01 ~$ dpkg -l | grep drbd
drbd-dkms                            9.2.12-2
drbd-utils                           9.30.0-1

root@pve01 ~$ pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20250211.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.2
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.3-1
proxmox-backup-file-restore: 3.3.3-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.6
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.0
pve-qemu-kvm: 9.2.0-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1

Today, I received this error:

TASK ERROR: API Return-Code: 500. Message: Could not rollback cluster wide snapshot snap_pm-a203816c_Netzwerk of pm-a203816c, because...

I put the output in collapsible tags and hope that makes it more readable.

In PVE console (shortened):

Summary
TASK ERROR: API Return-Code: 500. Message: Could not rollback cluster wide snapshot snap_pm-a203816c_Netzwerk of pm-a203816c, because: [{"ret_code":34340867,"message":"Snapshot 'snap_pm-a203816c_Netzwerk' of resource 'pm-a203816c' marked down for rollback.","obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:41.881991595+01:00"},{"ret_code":36962307,"message":"(pve02) Resource 'pm-a203816c' [DRBD] adjusted.","obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:41.986547094+01:00"},{"ret_code":34340867,"message":"Deactivated resource 'pm-a203816c' on 'pve02' for rollback","obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:41.98665986+01:00"},{"ret_code":36962307,"message":"(pve01) Resource 'pm-a203816c' [DRBD] adjusted.","obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:42.006239917+01:00"},{"ret_code":34340867,"message":"Deactivated resource 'pm-a203816c' on 'pve01' for rollback","obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:42.006305288+01:00"},{"ret_code":4611686018461738777,"message":"All satellites failed the snapshot rollback. Aborting. Data remains unchanged.","obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:42.240003653+01:00"},{"ret_code":-4611686018393046042,"message":"(Node: 'pve02') Failed to rollback to snapshot linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk","details":"Command 'lvconvert --config 'devices { filter=['\"'\"'a|/dev/nvme0n1|'\"'\"','\"'\"'r|.*|'\"'\"'] }' --merge linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk' returned with exitcode 5. \n\nStandard out: \n\n\nError message: \n  linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk is not a mergeable logical volume.\n\n","error_report_ids":["67E1012A-CF113-000000"],"obj_refs":{"RscDfn":"pm-a203816c","Snapshot":"snap_pm-a203816c_Netzwerk"},"created_at":"2025-03-24T08:01:42.298084003+01:00"},{"ret_code":-4611686018393046042,"message":"(Node: 'pve01') Failed to rollback to snapshot linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk","details":"Command 'lvconvert --config 'devices { filter=['\"'\"'a|/dev/nvme0n1|'\"'\"','\"'\"'r|.*|'\"'\"'] }' --merge linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk' returned with exitcode 5. \n\nStandard out: \n\n\nError message: \n  linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk is not a mergeable logical volume.\n\n","error_report_ids":["67E1013B-6F0E1-000000"]
...

And error report from Linstor-GUI:

Summary
ERROR REPORT 67E1012A-CF113-000000

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Satellite
Version:                            1.30.4
Build ID:                           bef74a44609cb592c5efad2e707b50e696623c61
Build time:                         2025-02-03T15:48:28+00:00
Error time:                         2025-03-24 08:01:46
Node:                               pve02
Thread:                             DeviceManager

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #69

Error message:                      Failed to rollback to snapshot linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk

Error context:
        An error occurred while processing resource 'Node: 'pve02', Rsc: 'pm-a203816c''
ErrorContext:
  Details:     Command 'lvconvert --config 'devices { filter=['"'"'a|/dev/nvme0n1|'"'"','"'"'r|.*|'"'"'] }' --merge linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk' returned with exitcode 5. 

Standard out: 


Error message: 
  linstor_LinstorStorage/pm-a203816c_00000_snap_pm-a203816c_Netzwerk is not a mergeable logical volume.




Call backtrace:

    Method                                   Native Class:Line number
    checkExitCode                            N      com.linbit.extproc.ExtCmdUtils:69
    genericExecutor                          N      com.linbit.linstor.storage.utils.Commands:103
    genericExecutor                          N      com.linbit.linstor.storage.utils.Commands:63
    genericExecutor                          N      com.linbit.linstor.storage.utils.Commands:51
    rollbackToSnapshot                       N      com.linbit.linstor.layer.storage.lvm.utils.LvmCommands:433
    lambda$rollbackImpl$10                   N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:371
    execWithRetry                            N      com.linbit.linstor.layer.storage.lvm.utils.LvmUtils:728
    rollbackImpl                             N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:368
    rollbackImpl                             N      com.linbit.linstor.layer.storage.lvm.LvmThinProvider:58
    handleRollbacks                          N      com.linbit.linstor.layer.storage.AbsStorageProvider:1325
    processVolumes                           N      com.linbit.linstor.layer.storage.AbsStorageProvider:390
    processResource                          N      com.linbit.linstor.layer.storage.StorageLayer:285
    lambda$processResource$4                 N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1368
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1411
    processResource                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1364
    processChild                             N      com.linbit.linstor.layer.drbd.DrbdLayer:353
    processResource                          N      com.linbit.linstor.layer.drbd.DrbdLayer:228
    lambda$processResource$4                 N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1368
    processGeneric                           N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1411
    processResource                          N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:1364
    processResources                         N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:386
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:228
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:333
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:1148
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:778
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:674
    run                                      N      java.lang.Thread:840


END OF ERROR REPORT.

While searching the forum, I came across the post by c.duchenoy from last October and wonder if it’s the same problem?

Does anyone have any tips for me or can point me in the right direction? I’m pretty sure that rollbacks have worked in the past - it seems there’s a dependency on this VM on Node pve02. The VM itself is running on pve01.

Thank you in advance.

Best,
Hiu

It seems like the rollback to a snapshot older than the last one always fails. Rolling back to the last snapshot works.
Deleting the last snapshot and then doing a rollback (to the second last one) also doesn’t work.

I don’t know if this finding helps to identify the issue.

Hi,

Your backend storage is lvm-thin or ZFS ?

If it is of type lvm-thin, yes it can be the same problem because the parent/child relationship of the snaps is lost

We’ve written a workaround

It works for us :slight_smile:

/Cyril

Hi Cyril

Yes, it’s lvm-thin too.

I have already seen your script and am currently looking at it. Fantastic work, thank you for that! I will try it out.

Best,
Hiu

Thank you, can you tell us if it did the job for you?

However, be careful, it’s only compatible with Qemu VMs, not validated for LXC containers.

Unfortunately, it seems not to have worked. My VID is 100 and the snapshot name is “Netzwerk”. I get the following output:

root@pve01 ~$ ./geco-linstor-pve-snap.sh rollback --vmid 100 --snap Netzwerk
Rollback DRBD volume snapshot: "Netzwerk" - Proxmox VMID: "100"

No further output on CLI. However, the VM was not restored. I will try this again with another test VM.

Does not work for me. Am I doing something wrong?

What I did:

  • created a snapshot (“Geco”)
  • changed hostname in VM (for later verification)
  • created a snapshot (“Test”)
  • tried to rollback to Snapshot “Geco” = same Error
  • run the script as follows:
root@pve01 ~$ ./geco-linstor-pve-snap.sh rollback --vmid 101 --snap Geco
Rollback DRBD volume snapshot: "Geco" - Proxmox VMID: "101"

Hi Gabor

Since you mentioned that you are working on a solution and are already in the testing phase:

Are there any news on that? Is there a chance for a short-term solution? My installation is a homelab. If you need another victim for testing, I’d be happy to help.

Best,
Hiu

Answering my own question. :grin:

Found this drbd-9.2.13-rc.1 announcement addressing an issue with lvm-thin:

LINSTOR sets the
rs-discard-granularity when the backing devices are thinly provisioned
(lvm-thin or zfs-thin).

I don’t quite understand what it’s about and there’s nothing more to read about it in the release notes of drbd-9.2.13, but maybe I should give it a try.

Unfortunately, this does not work under following setup (drbd-9.2.13):

root@pve01 ~$ dpkg -l | grep linstor
ii  linstor-client                       1.25.0-1
ii  linstor-common                       1.30.4-1
ii  linstor-controller                   1.30.4-1
ii  linstor-proxmox                      8.1.0-1
ii  linstor-satellite                    1.30.4-1
ii  python-linstor                       1.25.0-1

root@pve01 ~$ dpkg -l | grep drbd
ii  drbd-dkms                            9.2.13-1
ii  drbd-utils                           9.30.0-1

root@pve01 ~$ pveversion -v
proxmox-ve: 8.3.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
frr-pythontools: 8.5.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20250211.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.2
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.4-1
proxmox-backup-file-restore: 3.3.4-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.7
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.1
pve-qemu-kvm: 9.2.0-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

Hi,

In production, we use older versions of the Linstor stack, but I don’t think that’s the problem.

However, I believe that the snaps on the drbd side should be lowercase. See: Snapshots only lowercase

You can enable “DEBUG=true” mode in the script for more information.

Hi Cyril

Neither of these works for me. If I delete an old snapshot and then try to restore the last snapshot, I get the same errors. I also have no luck with DEBUG - the script only outputs one line and then stops without doing anything.

I also noticed that DRBD still keeps snapshots of already deleted VMs or deleted snapshots itself. I’m not sure if that’s how it’s supposed to work. Maybe I just have to learn how DRBD works, this is all new to me.

I might also just switch to backups (snapshots) via PBS. This works much better for me and I can restore any snapshot. I tested and can restore snap1, then do a rollback to snap2 and back again to snap1. This is actually what I want.

Best,
Hiu

The script has “set -e” mode enabled, so it exits immediately at the slightest error.

I think there’s something subtle in your environment.

DRBD/Linstor is a reliable and powerful technology that has some integration issues…

But I’d like to thank the Linbit team very much for making this great product available to us! (A long-time user since drbd8 :))

Yes, there’s a learning curve, but it’s essential for tech clusters!

Cyril

1 Like

Hello,

Sorry for the late response!

Since 1.31.0-rc1 LINSTOR should deal better with this situations. So if you are still willing to help, I’d appreciate feedback, although we are already working on the final 1.31.0.

In this version we changed the implementation of a linstor snapshot rollback from lvconvert --merge ... (see the ErrorReport in your first message) to a basically linstor resource delete ...; linstor snapshot restore .... Cyril correctly pointed out, that the issue here is the parent-child dependency. However with LINSTOR’s new approach we are no longer limited by that dependency since we have implemented LINSTOR’s “rollback” with a “delete + restore” approach, which does not care about such parent-child relations.
The same also works for ZFS but more complicated, since ZFS does not allow you to delete a ZVOL while it has snapshots, which LINSTOR needs to do with the new “delete + restore” approach. Therefore we had to implement some renaming and reference-counting for the ZFS use-case to make it work.

Let us know if you run into problems with the new LINSTOR version regarding this issue!

Best regards,
Gabor

Hi Gabor

Thank you for your response. I would be happy to test the upcoming release. My cluster is still in the trial phase. Is there somewhere I can read up on how to get and install this release?

Best,
Hiu

Hello Hiu,

The 1.31.0 (final) was in the meantime already released, so simply upgrade as usual.

Best regards,
Gabor

2 Likes