Snap restore on proxmox with lvm-thin as storage pool seems unstable

c.duchenoy · October 17, 2024, 2:54pm

Hi,

We have several proxmox install with linstor as SDS, we found out that if you take snapshot, and rollback with ZFS as a storage pool, everything works flawless.
However, when you try the same thing with a lvm-thin storage pool, it fails almost everytime. Sometimes it will work but when it is not you are left with stuck ressources.

We tried :

powered on or powered off VM
the oldest snap
the most recent one
snap in the middle of the “tree”
DrbdOptions/Net/allow-two-primaries set to yes or no : same problem

We did not find a stable working method of rolling back snapshots via the proxmox web UI.

I will try the same with the linstor CLI and report back

EDIT : Yes, same problem with the linstor snapshot rollback cmd

Description:
    (Node: 'vc-swarm3') Failed to rollback to snapshot pve/vm-700-disk-1_00000_snap_vm-700-disk-1_snap1
Details:
    Command 'lvconvert --config devices { filter=['a|/dev/sdg3|','r|.*|'] } --merge pve/vm-700-disk-1_00000_snap_vm-700-disk-1_snap1' returned with exitcode 5. 
    
    Standard out: 
    
    
    Error message: 
      pve/vm-700-disk-1_00000_snap_vm-700-disk-1_snap1 is not a mergeable logical volume.

And we can see that we end up with a snap that is not “linked” to the parent lvm

 vm-700-disk-1_00000                                          pve Vwi---tz--  10,00g thin-hdd                                                            
  vm-700-disk-1_00000_snap_vm-700-disk-1_snap1                 pve Vwi-a-tz-k  10,00g thin-hdd                     100,00

We have three cluster in that situation one with like so :

root@vc-swarm1:~# dpkg -l | grep linstor
ii  linstor-client                       1.18.0-1                                all          Linstor client command line tool
ii  linstor-common                       1.22.0-1                                all          DRBD distributed resource management utility
ii  linstor-controller                   1.22.0-1                                all          DRBD distributed resource management utility
ii  linstor-proxmox                      7.0.0-1                                 all          DRBD distributed resource management utility
hi  linstor-satellite                    1.22.0-1                                all          DRBD distributed resource management utility
ii  python-linstor                       1.18.0-1                                all          Linstor python api library

And two other :

└─$ dpkg -l | grep linstor
ii  linstor-client                       1.23.0-1                            all          Linstor client command line tool
hi  linstor-common                       1.29.0-1                            all          DRBD distributed resource management utility
ii  linstor-controller                   1.29.0-1                            all          DRBD distributed resource management utility
hi  linstor-proxmox                      8.0.4-1                             all          DRBD distributed resource management utility
ii  linstor-satellite                    1.29.0-1                            all          DRBD distributed resource management utility
ii  python-linstor                       1.23.0-1                            all          Linstor python api library

ghernadi · October 21, 2024, 6:19am

Hello,

Thank you for the report. We are aware of this issue. We have a few ideas we are currently testing that could address this issue.

Some technical background: If you have an LVM volume, and create let’s say 2 snapshots of it, both snapshots will have the original volume as their “origin”. If you know run a linstor snapshot rollback (which internally runs a lvconvert --merge $vg/$snapshot as the error message states), it merges the snapshot into its origin. So far everything is as expected. After this command two things have changed: First, the LVM snapshot is now gone (since it got merged), but LINSTOR simply creates a new snapshot to “fix” this point. The second point is more problematic: The second snapshot we created in the beginning, which was completely untouched by our linstor snapshot rollback and lvconvert --merge commands, also “lost” its origin. The data is still there and fine, but this second snapshot can no longer be “merged” into the already rolled back volume.

As a workaround, instead of using linstor snapshot rollback for LVM-THIN setups, you can manually delete the resources (not the -definition) and linstor snapshot resource restore --fr $rsc --tr $rsc --fs $snapshot the snapshot into the same resource-definition.

This is actually one of our plans we are testing right now. The idea sounds fine on LVM_THIN but unfortunately does not work that easily on ZFS setups since there you cannot just delete the volume while it has snapshots (there is a very strict parent-child dependency between ZVOLs and their snapshots).

c.duchenoy · October 21, 2024, 2:27pm

Helo @ghernadi and thanks for your answer. We will do so test with the suggested method and will report back in the next days !

vik-t · October 23, 2024, 9:40am

That sounds… scary.

Isn’t LVM-THIN like THE storage pool technology for Linstor? How is it possible then that this issue was not noticed before?

c.duchenoy · November 28, 2024, 10:20am

Hi @ghernadi

Txu for your input resource restore works as expected !

For proxmox user, we write a little script that restore selected snap

Cant upload here so view https://gist.github.com/cduchenoy/36d57d7d089f3553e15ba37fb7ed8258

Script output

$ geco-linstor-pve-snap rollback --vmid 100 --snap snap1

Rollback DRBD volume snapshot: "snap1" - Proxmox VMID: "100"
Found snap snap1 on drbd resource pm-56c30604
Stop VMID 100
UPID:vcg12:000CA929:061858C5:674838B5:qmstop:100:root@pam:

Remove drbd volume: pm-56c30604 on vcg11
SUCCESS:
Description:
    Node: vcg11, Resource: pm-56c30604 preparing for deletion.
...

Remove drbd volume: pm-56c30604 on vcg12
SUCCESS:
Description:
    Node: vcg12, Resource: pm-56c30604 preparing for deletion.
...

Remove drbd volume: pm-56c30604 on vcg13
SUCCESS:
Description:
    Node: vcg13, Resource: pm-56c30604 preparing for deletion.

Restore snap snap1 on pm-56c30604
...
+------------------------------------------------------------------+
| ResourceName | Port | ResourceGroup       | State | Proxmox VMID |
|==================================================================|
| pm-56c30604  | 7004 | vmdisks-ssd-volumes | ok    | 100          |
+------------------------------------------------------------------+
+------------------------------------------------------------------------------------------------------------------+
| Node  | Resource    | StoragePool              | VolNr | MinorNr | DeviceName    | Allocated | InUse  |    State |
|==================================================================================================================|
| vcg12 | pm-56c30604 | pool-ssd-vmdisks-volumes |     0 |    1005 | /dev/drbd1005 |   630 KiB | Unused | UpToDate |
| vcg13 | pm-56c30604 | pool-ssd-vmdisks-volumes |     0 |    1005 | /dev/drbd1005 |   630 KiB | Unused | UpToDate |
+------------------------------------------------------------------------------------------------------------------+

+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Resource    | Issue                                                                  | Possible fix                                              |
|==================================================================================================================================================|
| pm-56c30604 | Resource has 2 replicas but no tie-breaker, could lead to split brain. | linstor rd ap --drbd-diskless --place-count 1 pm-56c30604 |
+--------------------------------------------------------------------------------------------------------------------------------------------------+

Rollback Proxmox VM Config snapshot: "snap1" - Proxmox VMID: "100"
- Prepare new current block config
- Prepare block config for snap "snap1"
- Prepare block config for snap "snap2"
- Prepare block config for snap "snap3"

Enable new VM config...

kastob · December 29, 2024, 11:00pm

I also noticed that on a LVM-Thin storage pool a snapshor rollback doesn’t work. It appears to do the disk rollback correctly but won’t roll back memory and just stops the VM.

Error message is: “Error: start failed: QEMU exited with code 1 … blockdev: cannot open /dev/drbd/by-res/pm-e08830a6/0: Keine Daten verfÃ¼gbar kvm: -drive file=/dev/drbd/by-res/pm-e08830a6/0,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on: Could not open ‘/dev/drbd/by-res/pm-e08830a6/0’: No data available”

Is this the same issue as above and can we expect this to be fixed?

I was very enthusiastic about the Linstor/Proxmox integration when starting the evaluation, but the nasty lowercase snapshot name issue, the inability to move storage (without temporarily tampering with storage.cfg) and the inability to rollback snapshots is disappointing.

ghernadi · April 16, 2025, 5:54am

Hello, sorry for the delay! I believe we have announced this already in other places but just stumbled across this post.

The recently released LINSTOR 1.31.0 should properly deal with such rollback scenarios for LVM_THIN as well as ZFS + ZFS_THIN.

c.duchenoy · April 16, 2025, 9:46am

Hi,

Thanks for this feedback !

We’re still using 1.29 on our production clusters, but I’m working on trying this version on a test cluster.

I’ll publish our feedback on this post for the community.

Topic		Replies	Views
Proxmox: TASK ERROR: API Return-Code: 500. Message: Could not rollback cluster wide snapshot LINSTOR drbd	15	146	April 14, 2025
Proxmox Snapshots with ram Proxmox VE	3	282	May 14, 2024
Proxmox - Linbit install on an existing Proxmox cluster with LVM Thin already installed. How to install LINSTOR linstor	1	85	March 18, 2025
Snapshot by backup not working General	0	73	October 18, 2024
The satellite does not support the device provider LVM, LVMTHIN, ZFS - whatever! LINSTOR	2	110	January 29, 2025

Snap restore on proxmox with lvm-thin as storage pool seems unstable

Script output

Related topics