Snap restore on proxmox with lvm-thin as storage pool seems unstable

Hi,

We have several proxmox install with linstor as SDS, we found out that if you take snapshot, and rollback with ZFS as a storage pool, everything works flawless.
However, when you try the same thing with a lvm-thin storage pool, it fails almost everytime. Sometimes it will work but when it is not you are left with stuck ressources.

We tried :

  • powered on or powered off VM
  • the oldest snap
  • the most recent one
  • snap in the middle of the “tree”
  • DrbdOptions/Net/allow-two-primaries set to yes or no : same problem

We did not find a stable working method of rolling back snapshots via the proxmox web UI.

I will try the same with the linstor CLI and report back

EDIT : Yes, same problem with the linstor snapshot rollback cmd

Description:
    (Node: 'vc-swarm3') Failed to rollback to snapshot pve/vm-700-disk-1_00000_snap_vm-700-disk-1_snap1
Details:
    Command 'lvconvert --config devices { filter=['a|/dev/sdg3|','r|.*|'] } --merge pve/vm-700-disk-1_00000_snap_vm-700-disk-1_snap1' returned with exitcode 5. 
    
    Standard out: 
    
    
    Error message: 
      pve/vm-700-disk-1_00000_snap_vm-700-disk-1_snap1 is not a mergeable logical volume.

And we can see that we end up with a snap that is not “linked” to the parent lvm

 vm-700-disk-1_00000                                          pve Vwi---tz--  10,00g thin-hdd                                                            
  vm-700-disk-1_00000_snap_vm-700-disk-1_snap1                 pve Vwi-a-tz-k  10,00g thin-hdd                     100,00        

We have three cluster in that situation one with like so :

root@vc-swarm1:~# dpkg -l | grep linstor
ii  linstor-client                       1.18.0-1                                all          Linstor client command line tool
ii  linstor-common                       1.22.0-1                                all          DRBD distributed resource management utility
ii  linstor-controller                   1.22.0-1                                all          DRBD distributed resource management utility
ii  linstor-proxmox                      7.0.0-1                                 all          DRBD distributed resource management utility
hi  linstor-satellite                    1.22.0-1                                all          DRBD distributed resource management utility
ii  python-linstor                       1.18.0-1                                all          Linstor python api library

And two other :

└─$ dpkg -l | grep linstor
ii  linstor-client                       1.23.0-1                            all          Linstor client command line tool
hi  linstor-common                       1.29.0-1                            all          DRBD distributed resource management utility
ii  linstor-controller                   1.29.0-1                            all          DRBD distributed resource management utility
hi  linstor-proxmox                      8.0.4-1                             all          DRBD distributed resource management utility
ii  linstor-satellite                    1.29.0-1                            all          DRBD distributed resource management utility
ii  python-linstor                       1.23.0-1                            all          Linstor python api library

Hello,

Thank you for the report. We are aware of this issue. We have a few ideas we are currently testing that could address this issue.

Some technical background: If you have an LVM volume, and create let’s say 2 snapshots of it, both snapshots will have the original volume as their “origin”. If you know run a linstor snapshot rollback (which internally runs a lvconvert --merge $vg/$snapshot as the error message states), it merges the snapshot into its origin. So far everything is as expected. After this command two things have changed: First, the LVM snapshot is now gone (since it got merged), but LINSTOR simply creates a new snapshot to “fix” this point. The second point is more problematic: The second snapshot we created in the beginning, which was completely untouched by our linstor snapshot rollback and lvconvert --merge commands, also “lost” its origin. The data is still there and fine, but this second snapshot can no longer be “merged” into the already rolled back volume.

As a workaround, instead of using linstor snapshot rollback for LVM-THIN setups, you can manually delete the resources (not the -definition) and linstor snapshot resource restore --fr $rsc --tr $rsc --fs $snapshot the snapshot into the same resource-definition.

This is actually one of our plans we are testing right now. The idea sounds fine on LVM_THIN but unfortunately does not work that easily on ZFS setups since there you cannot just delete the volume while it has snapshots (there is a very strict parent-child dependency between ZVOLs and their snapshots).

Helo @ghernadi and thanks for your answer. We will do so test with the suggested method and will report back in the next days !

That sounds… scary.

Isn’t LVM-THIN like THE storage pool technology for Linstor? How is it possible then that this issue was not noticed before?