Resources Stuck in DELETING/UNKNOWN/CloningFailed States - Cannot Delete

Environment:

  • LINSTOR Version: 1.32.1 (Satellite), 1.26.1 (Client)

  • Storage Backend: ZFS (ZFS_THIN pools)

  • Setup: Multiple nodes (haserver62, haserver66, haserver64, etc.) with nvme0-nvme3 ZFS pools

  • DRBD Version: 9.x

  • OS: Ubuntu 24.04

Problem Summary:

We have 176+ resources stuck in various problematic states (DELETING, CloningFailed, Unknown), some for several months. These resources cannot be deleted through normal LINSTOR commands and cause socket timeouts. The issue started after cloning operations and appears to be related to ZFS snapshot management and LINSTOR’s rename-based deletion strategy.

Symptoms:

  1. Resources stuck in DELETING state since September 2025

  2. Resources showing CloningFailed on one node, Unknown on peer node

  3. linstor resource-definition delete commands timeout after 300+ seconds

  4. Resources show “Ok” status but remain in DELETING state

  5. Physical ZFS volumes exist but LINSTOR cannot delete them

  6. Inconsistent states between replicas on different nodes

Example of Stuck Resources:

┊ Disk_1654911  ┊ haserver62  ┊ DRBD,STORAGE ┊  ┊ Ok  ┊ CloningFailed ┊ 2025-12-03 10:20:55 ┊
┊ Disk_1654911  ┊ haserver66  ┊ DRBD,STORAGE ┊  ┊     ┊ Unknown       ┊ 2025-12-03 10:15:33 ┊
┊ Disk_1654927  ┊ haserver62  ┊ DRBD,STORAGE ┊  ┊ Ok  ┊ DELETING      ┊ 2025-12-03 12:57:03 ┊
┊ Disk_1654927  ┊ haserver66  ┊ DRBD,STORAGE ┊  ┊ Ok  ┊ DELETING      ┊ 2025-12-03 12:52:03 ┊

Note: Resources show different states on different nodes (CloningFailed vs Unknown), making recovery difficult.

Error Patterns from Satellite Logs:

1. ZFS Rename Failures:

Dec 08 03:05:43 HASERVER62 Satellite[1244981]: 2025-12-08 03:05:43.563 [DeviceManager] ERROR LINSTOR/Satellite/f0a2ba SYSTEM - Failed to rename zfs volume from 'nvme3/Disk_1654901_00000' to 'nvme3/_deleted_Disk_1654901_00000_2025-12-03T08-27-41-285' [Report number 69363C80-33AF4-000303]

2. ZFS Volume Already Deleted (but LINSTOR unaware):

Dec 07 11:22:54 haserver62 Satellite[2741089]: 2025-12-07 11:22:54.394 [DeviceManager] INFO LINSTOR/Satellite/09ee5a SYSTEM - Volume number 0 of resource 'Disk_1654933' [ZFS-Thin] deleted

3. Clone Operations Creating Safety Snapshots:

Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.382 [DeviceManager] INFO LINSTOR/Satellite/4a9b3b SYSTEM - Clone base snapshot Disk_1654733_00000@CF_2b68d0a already found, reusing.
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.382 [DeviceManager] INFO LINSTOR/Satellite/4a9b3b SYSTEM - Lv snapshot created Disk_1654733_00000/Disk_1654732
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.510 [DeviceManager] ERROR LINSTOR/Satellite/4a9b3b SYSTEM - Failed to rename zfs volume from 'nvme2/Disk_1654733_00000' to 'nvme2/_deleted_Disk_1654733_00000_2025-12-03T04-25-04-339' [Report number 69363C16-5D9E6-000003]

Root Cause Analysis:

From error reports, we identified that LINSTOR is:

  1. Creating safety snapshots during deletion: Disk_1654733_00000@CF_2b68d0a

  2. Creating clones from snapshots: Disk_1654733_00000/Disk_1654732

  3. Attempting to RENAME volumes instead of destroying them: nvme2/Disk_1654733_00000nvme2/_deleted_Disk_1654733_00000_<timestamp>

  4. Rename operations failing due to dependent snapshots/clones

  5. Leaving orphaned clones and snapshots that block future operations

Clone Operation Errors:

Missing Template Snapshot:

ErrorContext:
  Description: Clone command failed
  Cause: cannot open 'nvme3/ubuntu-24_04-x86_64_img_00000@CF_2d2bfcd': dataset does not exist
         cannot receive: failed to read from stream

Error message: None 0 exit from: [timeout, 0, bash, -c, set -o pipefail; zfs send --embed --large-block nvme3/ubuntu-24_04-x86_64_img_00000@CF_2d2bfcd | zfs receive -F nvme3/Disk_1654933_00000]

Dataset Already Exists (partial clone):

Details: Command 'zfs create -s -V 3670840KB nvme3/Disk_1654933_00000' returned with exitcode 1.
Error message: cannot create 'nvme3/Disk_1654933_00000': dataset already exists

DRBD Timeout Warnings:

We also see DRBD timeout configuration warnings that may be contributing:

ERROR REPORT 69356261-33AF4-000002
Error message: The external command 'drbdsetup' exited with error code 5

Details: The full command line executed was:
drbdsetup wait-connect-resource --wait-after-sb=yes --wfc-timeout=10 Disk_1654579

The external command sent the following error information:
degr-wfc-timeout has to be shorter than wfc-timeout
degr-wfc-timeout implicitly set to wfc-timeout (10s)
outdated-wfc-timeout has to be shorter than degr-wfc-timeout
outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)

What We’ve Tried:

  1. Standard deletion commands: linstor resource-definition delete <resource> - Results in 300s+ timeout

  2. Async deletion: linstor resource-definition delete --async <resource> - Same timeout

  3. Per-node deletion: linstor resource delete <node> <resource> - Hangs

  4. Checking ZFS volumes directly:

    zfs list -t all | grep Disk_1654933nvme2/Disk_1654933_00000    56K  8.58T  56K  -
    
    
    • Volumes exist and can be destroyed manually with zfs destroy -r
  5. Restarting satellite services - No improvement

  6. Checking for blocking processes - None found

Questions:

  1. Why is LINSTOR using a RENAME strategy instead of direct zfs destroy for deletions?

  2. How do we handle resources stuck in “CloningFailed” state on one node and “Unknown” on another?

  3. Is there a way to set StorDriver/Zfs/DeleteStrategy property in LINSTOR 1.26.1/1.32.1?

    • We get “Invalid property key: StorDriver/Zfs/DeleteStrategy” / “not whitelisted” error

    • Available properties only show: StorDriver/StorPoolName, StorDriver/internal/AllocationGranularity

  4. How can we safely recover from this state without losing production data on other resources?

  5. Should we upgrade LINSTOR to a newer version (1.31.1+) that supports the DeleteStrategy property?

  6. What’s the proper way to clean up resources when clone operations fail mid-process?

  7. Why do cloned resources leave dependent snapshots that prevent deletion?

Current State:

linstor r l | grep -E 'DELETING|CloningFailed|Unknown' | wc -l
176+

Affected Nodes and Pools:

  • Primary nodes: haserver62, haserver66, haserver64

  • Storage pools: sp_nvme0, sp_nvme1, sp_nvme2, sp_nvme3 (ZFS_THIN)

  • ZFS pools: nvme0, nvme1, nvme2, nvme3

Manual ZFS Verification:

Physical volumes exist but LINSTOR cannot delete them:

# On haserver62:
zfs list -t all | grep Disk_1654933
nvme3/Disk_1654933_00000                          56K  8.87T  56K  -
nvme3/ubuntu-24_04-x86_64_img_00000            3.52G  8.87T  3.52G  -
nvme3/ubuntu-24_04-x86_64_img_00000@CF_*      (multiple snapshots)

# Manual deletion works:
zfs destroy -n nvme3/Disk_1654933_00000  # Returns no errors

Workaround Attempts:

  1. We can manually delete ZFS volumes with zfs destroy -r, but LINSTOR metadata remains stuck

  2. Using linstor node lost/restore seems drastic for a production environment with active VMs

  3. Cannot set ZFS delete strategy due to LINSTOR version limitations

Request for Help:

  1. What’s the recommended approach to clear these stuck resources without impacting production?

  2. Is there a known issue with ZFS rename-based deletion in LINSTOR 1.26.1/1.32.1?

  3. Should we upgrade to LINSTOR 1.31.1+ before attempting cleanup?

  4. How do we resolve “CloningFailed” and “Unknown” states?

  5. Are there any database-level cleanup procedures for stuck resource metadata?

  6. What’s the best practice for handling failed clone operations?

Any guidance would be greatly appreciated!


Additional Information Available:

  • Full error reports from /var/log/linstor-satellite/ErrorReport*

  • Complete resource listings (176+ stuck resources)

  • ZFS pool configurations and volume lists

  • DRBD status outputs

  • Controller and satellite logs