Resources Stuck in DELETING/UNKNOWN/CloningFailed States - Cannot Delete

manoj · December 8, 2025, 4:05am

Environment:

LINSTOR Version: 1.32.1 (Satellite), 1.26.1 (Client)
Storage Backend: ZFS (ZFS_THIN pools)
Setup: Multiple nodes (haserver62, haserver66, haserver64, etc.) with nvme0-nvme3 ZFS pools
DRBD Version: 9.x
OS: Ubuntu 24.04

Problem Summary:

We have 176+ resources stuck in various problematic states (DELETING, CloningFailed, Unknown), some for several months. These resources cannot be deleted through normal LINSTOR commands and cause socket timeouts. The issue started after cloning operations and appears to be related to ZFS snapshot management and LINSTOR’s rename-based deletion strategy.

Symptoms:

Resources stuck in DELETING state since September 2025
Resources showing CloningFailed on one node, Unknown on peer node
linstor resource-definition delete commands timeout after 300+ seconds
Resources show “Ok” status but remain in DELETING state
Physical ZFS volumes exist but LINSTOR cannot delete them
Inconsistent states between replicas on different nodes

Example of Stuck Resources:

┊ Disk_1654911  ┊ haserver62  ┊ DRBD,STORAGE ┊  ┊ Ok  ┊ CloningFailed ┊ 2025-12-03 10:20:55 ┊
┊ Disk_1654911  ┊ haserver66  ┊ DRBD,STORAGE ┊  ┊     ┊ Unknown       ┊ 2025-12-03 10:15:33 ┊
┊ Disk_1654927  ┊ haserver62  ┊ DRBD,STORAGE ┊  ┊ Ok  ┊ DELETING      ┊ 2025-12-03 12:57:03 ┊
┊ Disk_1654927  ┊ haserver66  ┊ DRBD,STORAGE ┊  ┊ Ok  ┊ DELETING      ┊ 2025-12-03 12:52:03 ┊

Note: Resources show different states on different nodes (CloningFailed vs Unknown), making recovery difficult.

Error Patterns from Satellite Logs:

1. ZFS Rename Failures:

Dec 08 03:05:43 HASERVER62 Satellite[1244981]: 2025-12-08 03:05:43.563 [DeviceManager] ERROR LINSTOR/Satellite/f0a2ba SYSTEM - Failed to rename zfs volume from 'nvme3/Disk_1654901_00000' to 'nvme3/_deleted_Disk_1654901_00000_2025-12-03T08-27-41-285' [Report number 69363C80-33AF4-000303]

2. ZFS Volume Already Deleted (but LINSTOR unaware):

Dec 07 11:22:54 haserver62 Satellite[2741089]: 2025-12-07 11:22:54.394 [DeviceManager] INFO LINSTOR/Satellite/09ee5a SYSTEM - Volume number 0 of resource 'Disk_1654933' [ZFS-Thin] deleted

3. Clone Operations Creating Safety Snapshots:

Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.382 [DeviceManager] INFO LINSTOR/Satellite/4a9b3b SYSTEM - Clone base snapshot Disk_1654733_00000@CF_2b68d0a already found, reusing.
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.382 [DeviceManager] INFO LINSTOR/Satellite/4a9b3b SYSTEM - Lv snapshot created Disk_1654733_00000/Disk_1654732
Dec 08 02:47:23 haserver66 Satellite[3770293]: 2025-12-08 02:47:23.510 [DeviceManager] ERROR LINSTOR/Satellite/4a9b3b SYSTEM - Failed to rename zfs volume from 'nvme2/Disk_1654733_00000' to 'nvme2/_deleted_Disk_1654733_00000_2025-12-03T04-25-04-339' [Report number 69363C16-5D9E6-000003]

Root Cause Analysis:

From error reports, we identified that LINSTOR is:

Creating safety snapshots during deletion: Disk_1654733_00000@CF_2b68d0a
Creating clones from snapshots: Disk_1654733_00000/Disk_1654732
Attempting to RENAME volumes instead of destroying them: nvme2/Disk_1654733_00000 → nvme2/_deleted_Disk_1654733_00000_<timestamp>
Rename operations failing due to dependent snapshots/clones
Leaving orphaned clones and snapshots that block future operations

Clone Operation Errors:

Missing Template Snapshot:

ErrorContext:
  Description: Clone command failed
  Cause: cannot open 'nvme3/ubuntu-24_04-x86_64_img_00000@CF_2d2bfcd': dataset does not exist
         cannot receive: failed to read from stream

Error message: None 0 exit from: [timeout, 0, bash, -c, set -o pipefail; zfs send --embed --large-block nvme3/ubuntu-24_04-x86_64_img_00000@CF_2d2bfcd | zfs receive -F nvme3/Disk_1654933_00000]

Dataset Already Exists (partial clone):

Details: Command 'zfs create -s -V 3670840KB nvme3/Disk_1654933_00000' returned with exitcode 1.
Error message: cannot create 'nvme3/Disk_1654933_00000': dataset already exists

DRBD Timeout Warnings:

We also see DRBD timeout configuration warnings that may be contributing:

ERROR REPORT 69356261-33AF4-000002
Error message: The external command 'drbdsetup' exited with error code 5

Details: The full command line executed was:
drbdsetup wait-connect-resource --wait-after-sb=yes --wfc-timeout=10 Disk_1654579

The external command sent the following error information:
degr-wfc-timeout has to be shorter than wfc-timeout
degr-wfc-timeout implicitly set to wfc-timeout (10s)
outdated-wfc-timeout has to be shorter than degr-wfc-timeout
outdated-wfc-timeout implicitly set to degr-wfc-timeout (10s)

What We’ve Tried:

Standard deletion commands: linstor resource-definition delete <resource> - Results in 300s+ timeout
Async deletion: linstor resource-definition delete --async <resource> - Same timeout
Per-node deletion: linstor resource delete <node> <resource> - Hangs
Checking ZFS volumes directly:
```
zfs list -t all | grep Disk_1654933nvme2/Disk_1654933_00000    56K  8.58T  56K  -
```
- Volumes exist and can be destroyed manually with zfs destroy -r
Restarting satellite services - No improvement
Checking for blocking processes - None found

Questions:

Why is LINSTOR using a RENAME strategy instead of direct zfs destroy for deletions?
How do we handle resources stuck in “CloningFailed” state on one node and “Unknown” on another?
Is there a way to set StorDriver/Zfs/DeleteStrategy property in LINSTOR 1.26.1/1.32.1?
- We get “Invalid property key: StorDriver/Zfs/DeleteStrategy” / “not whitelisted” error
- Available properties only show: StorDriver/StorPoolName, StorDriver/internal/AllocationGranularity
How can we safely recover from this state without losing production data on other resources?
Should we upgrade LINSTOR to a newer version (1.31.1+) that supports the DeleteStrategy property?
What’s the proper way to clean up resources when clone operations fail mid-process?
Why do cloned resources leave dependent snapshots that prevent deletion?

Current State:

linstor r l | grep -E 'DELETING|CloningFailed|Unknown' | wc -l
176+

Affected Nodes and Pools:

Primary nodes: haserver62, haserver66, haserver64
Storage pools: sp_nvme0, sp_nvme1, sp_nvme2, sp_nvme3 (ZFS_THIN)
ZFS pools: nvme0, nvme1, nvme2, nvme3

Manual ZFS Verification:

Physical volumes exist but LINSTOR cannot delete them:

# On haserver62:
zfs list -t all | grep Disk_1654933
nvme3/Disk_1654933_00000                          56K  8.87T  56K  -
nvme3/ubuntu-24_04-x86_64_img_00000            3.52G  8.87T  3.52G  -
nvme3/ubuntu-24_04-x86_64_img_00000@CF_*      (multiple snapshots)

# Manual deletion works:
zfs destroy -n nvme3/Disk_1654933_00000  # Returns no errors

Workaround Attempts:

We can manually delete ZFS volumes with zfs destroy -r, but LINSTOR metadata remains stuck
Using linstor node lost/restore seems drastic for a production environment with active VMs
Cannot set ZFS delete strategy due to LINSTOR version limitations

Request for Help:

What’s the recommended approach to clear these stuck resources without impacting production?
Is there a known issue with ZFS rename-based deletion in LINSTOR 1.26.1/1.32.1?
Should we upgrade to LINSTOR 1.31.1+ before attempting cleanup?
How do we resolve “CloningFailed” and “Unknown” states?
Are there any database-level cleanup procedures for stuck resource metadata?
What’s the best practice for handling failed clone operations?

Any guidance would be greatly appreciated!

Additional Information Available:

Full error reports from /var/log/linstor-satellite/ErrorReport*
Complete resource listings (176+ stuck resources)
ZFS pool configurations and volume lists
DRBD status outputs
Controller and satellite logs

Topic		Replies	Views
Resource is stucked in deleting LINSTOR drbd , linstor	1	676	January 9, 2025
How can I recover from failed storage? LINSTOR drbd , linstor	1	165	March 9, 2025
How to find why resource is in use / how to force delete? LINSTOR linstor	1	499	November 20, 2024
Linstor-controller just went down LINSTOR latest	10	776	January 30, 2025
Linstor-server 1.31.0 release Release Announcements linstor	0	124	April 8, 2025

Resources Stuck in DELETING/UNKNOWN/CloningFailed States - Cannot Delete

Related topics