Environment:
-
LINSTOR Version: 1.32.1 (Satellite), 1.26.1 (Client)
-
Storage Backend: ZFS (ZFS_THIN pools)
-
Setup: Multiple nodes (haserver62, haserver66, haserver64, etc.) with nvme0-nvme3 ZFS pools
-
DRBD Version: 9.x
-
OS: Ubuntu 24.04
Problem Summary:
We have 176+ resources stuck in various problematic states (DELETING, CloningFailed, Unknown), some for several months. These resources cannot be deleted through normal LINSTOR commands and cause socket timeouts. The issue started after cloning operations and appears to be related to ZFS snapshot management and LINSTOR’s rename-based deletion strategy.
Symptoms:
-
Resources stuck in DELETING state since September 2025
-
Resources showing CloningFailed on one node, Unknown on peer node
-
linstor resource-definition deletecommands timeout after 300+ seconds -
Resources show “Ok” status but remain in DELETING state
-
Physical ZFS volumes exist but LINSTOR cannot delete them
-
Inconsistent states between replicas on different nodes
Example of Stuck Resources:
┊ Disk_1654911 ┊ haserver62 ┊ DRBD,STORAGE ┊ ┊ Ok ┊ CloningFailed ┊ 2025-12-03 10:20:55 ┊
┊ Disk_1654911 ┊ haserver66 ┊ DRBD,STORAGE ┊ ┊ ┊ Unknown ┊ 2025-12-03 10:15:33 ┊
┊ Disk_1654927 ┊ haserver62 ┊ DRBD,STORAGE ┊ ┊ Ok ┊ DELETING ┊ 2025-12-03 12:57:03 ┊
┊ Disk_1654927 ┊ haserver66 ┊ DRBD,STORAGE ┊ ┊ Ok ┊ DELETING ┊ 2025-12-03 12:52:03 ┊
Note: Resources show different states on different nodes (CloningFailed vs Unknown), making recovery difficult.
Error Patterns from Satellite Logs:
1. ZFS Rename Failures:
Dec 08 03:05:43 HASERVER62 Satellite[1244981]: 2025-12-08 03:05:43.563 [DeviceManager] ERROR LINSTOR/Satellite/f0a2ba SYSTEM - Failed to rename zfs volume from 'nvme3/Disk_1654901_00000' to 'nvme3/_deleted_Disk_1654901_00000_2025-12-03T08-27-41-285' [Report number 69363C80-33AF4-000303]
2. ZFS Volume Already Deleted (but LINSTOR unaware):
Dec 07 11:22:54 haserver62 Satellite[2741089]: 2025-12-07 11:22:54.394 [DeviceManager] INFO LINSTOR/Satellite/09ee5a SYSTEM - Volume number 0 of resource 'Disk_1654933' [ZFS-Thin] deleted
3. Clone Operations Creating Safety Snapshots:
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.382 [DeviceManager] INFO LINSTOR/Satellite/4a9b3b SYSTEM - Clone base snapshot Disk_1654733_00000@CF_2b68d0a already found, reusing.
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.382 [DeviceManager] INFO LINSTOR/Satellite/4a9b3b SYSTEM - Lv snapshot created Disk_1654733_00000/Disk_1654732
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.510 [DeviceManager] ERROR LINSTOR/Satellite/4a9b3b SYSTEM - Failed to rename zfs volume from 'nvme2/Disk_1654733_00000' to 'nvme2/_deleted_Disk_1654733_00000_2025-12-03T04-25-04-339' [Report number 69363C16-5D9E6-000003]
Root Cause Analysis:
From error reports, we identified that LINSTOR is:
-
Creating safety snapshots during deletion:
Disk_1654733_00000@CF_2b68d0a -
Creating clones from snapshots:
Disk_1654733_00000/Disk_1654732 -
Attempting to RENAME volumes instead of destroying them:
nvme2/Disk_1654733_00000→nvme2/_deleted_Disk_1654733_00000_<timestamp> -
Rename operations failing due to dependent snapshots/clones
-
Leaving orphaned clones and snapshots that block future operations
Clone Operation Errors:
Missing Template Snapshot:
ErrorContext:
Description: Clone command failed
Cause: cannot open 'nvme3/ubuntu-24_04-x86_64_img_00000@CF_2d2bfcd': dataset does not exist
cannot receive: failed to read from stream
Error message: None 0 exit from: [timeout, 0, bash, -c, set -o pipefail; zfs send --embed --large-block nvme3/ubuntu-24_04-x86_64_img_00000@CF_2d2bfcd | zfs receive -F nvme3/Disk_1654933_00000]
Dataset Already Exists (partial clone):
Details: Command 'zfs create -s -V 3670840KB nvme3/Disk_1654933_00000' returned with exitcode 1.
Error message: cannot create 'nvme3/Disk_1654933_00000': dataset already exists
DRBD Timeout Warnings:
We also see DRBD timeout configuration warnings that may be contributing:
ERROR REPORT 69356261-33AF4-000002
Error message: The external command 'drbdsetup' exited with error code 5
Details: The full command line executed was:
drbdsetup wait-connect-resource --wait-after-sb=yes --wfc-timeout=10 Disk_1654579
The external command sent the following error information:
degr-wfc-timeout has to be shorter than wfc-timeout
degr-wfc-timeout implicitly set to wfc-timeout (10s)
outdated-wfc-timeout has to be shorter than degr-wfc-timeout
outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)
What We’ve Tried:
-
Standard deletion commands:
linstor resource-definition delete <resource>- Results in 300s+ timeout -
Async deletion:
linstor resource-definition delete --async <resource>- Same timeout -
Per-node deletion:
linstor resource delete <node> <resource>- Hangs -
Checking ZFS volumes directly:
zfs list -t all | grep Disk_1654933nvme2/Disk_1654933_00000 56K 8.58T 56K -- Volumes exist and can be destroyed manually with
zfs destroy -r
- Volumes exist and can be destroyed manually with
-
Restarting satellite services - No improvement
-
Checking for blocking processes - None found
Questions:
-
Why is LINSTOR using a RENAME strategy instead of direct
zfs destroyfor deletions? -
How do we handle resources stuck in “CloningFailed” state on one node and “Unknown” on another?
-
Is there a way to set
StorDriver/Zfs/DeleteStrategyproperty in LINSTOR 1.26.1/1.32.1?-
We get “Invalid property key: StorDriver/Zfs/DeleteStrategy” / “not whitelisted” error
-
Available properties only show:
StorDriver/StorPoolName,StorDriver/internal/AllocationGranularity
-
-
How can we safely recover from this state without losing production data on other resources?
-
Should we upgrade LINSTOR to a newer version (1.31.1+) that supports the DeleteStrategy property?
-
What’s the proper way to clean up resources when clone operations fail mid-process?
-
Why do cloned resources leave dependent snapshots that prevent deletion?
Current State:
linstor r l | grep -E 'DELETING|CloningFailed|Unknown' | wc -l
176+
Affected Nodes and Pools:
-
Primary nodes: haserver62, haserver66, haserver64
-
Storage pools: sp_nvme0, sp_nvme1, sp_nvme2, sp_nvme3 (ZFS_THIN)
-
ZFS pools: nvme0, nvme1, nvme2, nvme3
Manual ZFS Verification:
Physical volumes exist but LINSTOR cannot delete them:
# On haserver62:
zfs list -t all | grep Disk_1654933
nvme3/Disk_1654933_00000 56K 8.87T 56K -
nvme3/ubuntu-24_04-x86_64_img_00000 3.52G 8.87T 3.52G -
nvme3/ubuntu-24_04-x86_64_img_00000@CF_* (multiple snapshots)
# Manual deletion works:
zfs destroy -n nvme3/Disk_1654933_00000 # Returns no errors
Workaround Attempts:
-
We can manually delete ZFS volumes with
zfs destroy -r, but LINSTOR metadata remains stuck -
Using
linstor node lost/restoreseems drastic for a production environment with active VMs -
Cannot set ZFS delete strategy due to LINSTOR version limitations
Request for Help:
-
What’s the recommended approach to clear these stuck resources without impacting production?
-
Is there a known issue with ZFS rename-based deletion in LINSTOR 1.26.1/1.32.1?
-
Should we upgrade to LINSTOR 1.31.1+ before attempting cleanup?
-
How do we resolve “CloningFailed” and “Unknown” states?
-
Are there any database-level cleanup procedures for stuck resource metadata?
-
What’s the best practice for handling failed clone operations?
Any guidance would be greatly appreciated!
Additional Information Available:
-
Full error reports from
/var/log/linstor-satellite/ErrorReport* -
Complete resource listings (176+ stuck resources)
-
ZFS pool configurations and volume lists
-
DRBD status outputs
-
Controller and satellite logs