Linstor-controller just went down

boedy · August 27, 2024, 6:00pm

Hello I hope someone can help me

Our Linstor production cluster just went down and I’m not able to get it back up. The controller is in a crashloop spitting out this error message every time:

ERROR REPORT 66CE118B-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.28.0
Build ID:                           959382f7b4fb9436fefdd21dfa262e90318edaed
Build time:                         2024-07-11T10:21:06+00:00
Error time:                         2024-08-27 17:49:13
Node:                               linstor-controller-7b9c4ccd45-xk4lf
Thread:                             Main

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'loadCoreObjects', Source file 'DatabaseLoader.java', Line #680

Error message:                      Unknown error during loading data from DB

Call backtrace:

    Method                                   Native Class:Line number
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:680
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'loadAll', Source file 'AbsDatabaseDriver.java', Line #190

Error message:                      Failed to restore data

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:190
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625

Caused by:
==========

Category:                           Exception
Class name:                         ValueInUseException
Class canonical name:               com.linbit.ValueInUseException
Generated at:                       Method 'allocate', Source file 'DynamicNumberPoolImpl.java', Line #124

Error message:                      TCP port 7077 is already in use

Call backtrace:

    Method                                   Native Class:Line number
    allocate                                 N      com.linbit.linstor.numberpool.DynamicNumberPoolImpl:124
    <init>                                   N      com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:110
    genericCreate                            N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:268
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:231
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:49
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625


END OF ERROR REPORT.

liniac · August 27, 2024, 8:53pm

LINSTOR’s complaint appears to be that it cannot allocate using port 7077 as it’s already being used. LINSTOR defaults to using ports 7000 - 7999 to allocate for resources, but if something is already running on a port and LINSTOR attempts to assign it, it will result in an error.

Is that the case in your situation?

If so, you are able to change the TcpPortAutoRange to a different range that works better for your stack. You mention the controller isn’t staying up though, can you describe the behavior you are seeing in more detail? Is this a LINSTOR in Kubernetes implementation? If not, what does your environment look like (number of controllers/satellites)?

boedy · August 27, 2024, 10:06pm

Thanks for your prompt reply.

Fortunately, I was able to restore the LINSTOR controllers within 2 hours of the outage.

Upon further investigation, I found that there were multiple DRBD resource definitions attempting to use the same TCP port (7077), which led to the controller crash. Specifically, the following resource definitions were involved:

{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:48:18Z",
      "generation":1,
      "name":"c1c9d57513a2ed77a149efe7aa74b2a0155954b5c0c0da0cc0e82d6a1d6e7126",
      "uid":"a9cee388-eb77-4b0a-b421-7861ff475eb9"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-00A767E6-18BF-477D-8595-CDB9BA606B36",
      "resource_name_suffix":"",
      "secret":"yui4uCFI4Ke4wD8QIpvI",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}

{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:47:37Z",
      "generation":1,
      "name":"54a30c138618a3fe355253cae97291b0d4457f365b9754278759789bd630518c",
      "uid":"6385ca3a-8e4f-46c3-b285-34332e3779fb"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-90F08CA1-DDE8-4A82-9794-D50F411F8712",
      "resource_name_suffix":"",
      "secret":"63zoFVtp1aZkLHGNVVLv",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}

After detecting these conflicting resource definitions, I decided to delete them to resolve the issue. However, since there was no straightforward way to delete these entries directly via the LINSTOR controller, I had to search through all linstor CRD’s matching the PVC names listed above.

I methodically deleted these records, and after each deletion, I attempted to start the LINSTOR controller again. Each time the controller crashed, the new error reports provided additional hints on which CRD’s needed to be removed (whilst I created backups of the file incase I deleted the wrong files. This trial-and-error process was time-consuming, but by following the clues from each error report, I was eventually able to delete all the problematic entries and get the controller running successfully.

Reflecting on the cause of the issue, it seems that the data may have been corrupted somehow. Since I’m using the Kubernetes store, all information is stored within Kubernetes itself. I’m concerned about its reliability compared to using a dedicated database like PostgreSQL, which I believe offers more robust transactional support in case of failure like today. Ending up in a state like this shouldn’t be possible or the controller should be able to recover from it.

How can I prevent such issues from occurring in the future?

boedy · August 27, 2024, 10:18pm

One thing I forgot to mention is that, prior to this issue, the control plane of the Kubernetes cluster went down. I believe this affected LINSTOR as it could not communicate with the K8S API. Having said that, I would expect LINSTOR to be able to resist such an outage without resulting in data corruption. Ideally, any transaction in progress during the outage should be rolled back to prevent issues like this.

Would the outcome have been different if PostgreSQL had been used instead of the Kubernetes store? Specifically, what would happen if PostgreSQL went down during a random outage? Would LINSTOR be more resilient in that scenario, or are there other steps that should be taken to protect against data corruption?

liniac · August 27, 2024, 11:26pm

To confirm, which backend are using currently for LINSTOR? This can be displayed here:

kubectl exec deploy/linstor-op-cs-controller – cat /etc/linstor/linstor.toml

A LINSTOR cluster within k8s using an etcd database might run into problems like you’ve encountered, which is why the current guidance is to migrate to using the Kubernetes API directly to persist the cluster state. If that’s what you are already using, the explanation for why the controller behaved how it did may be different. But use of the k8s API as opposed to a separate database, I see as more resilient due to the tighter coupling.

boedy · August 28, 2024, 8:34am

#/etc/linstor/linstor.toml
[db]
  connection_url = "k8s"

We are running Linstor using the Piraeus operator on a K3S cluster with MySQL as backing its datastore.

boedy · January 27, 2025, 2:58pm

We’re running into a similar issue again, where the entire LINSTOR cluster is down. The controller seems to be stuck in a boot loop, and from what I can tell, this appears to be caused by a corrupted LINSTOR state.

I’m looking for guidance on:
1. What might be causing this issue?
2. How can the LINSTOR state be restored in such scenarios?
3. How can we prevent this from happening in the future?

Below two error reports for reference:

ERROR REPORT 67979A9C-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-27 14:39:46
Node:                               linstor-controller-547876f655-8zspc
Thread:                             Main

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         LinStorDBRuntimeException
Class canonical name:               com.linbit.linstor.LinStorDBRuntimeException
Generated at:                       Method 'loadAll', Source file 'K8sCrdEngine.java', Line #267

Error message:                      Database entry of table LAYER_DRBD_VOLUMES could not be restored.

ErrorContext:   Details:     Primary key: LAYER_RESOURCE_ID = '23232', VLM_NR = '0'


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:267
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'getRscData', Source file 'AbsLayerVlmDataDbDriver.java', Line #66

Error message:                      Cannot read field "rscData" because the return value of "java.util.Map.get(Object)" is null

Call backtrace:

    Method                                   Native Class:Line number
    getRscData                               N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver$VlmParentObjects:66
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:153
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:39
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.

ERROR REPORT 67979A9C-00000-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-27 14:39:46
Node:                               linstor-controller-547876f655-8zspc
Thread:                             Main

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #104

Error message:                      Unhandled exception

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:104
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         LinStorDBRuntimeException
Class canonical name:               com.linbit.linstor.LinStorDBRuntimeException
Generated at:                       Method 'loadAll', Source file 'K8sCrdEngine.java', Line #267

Error message:                      Database entry of table LAYER_DRBD_VOLUMES could not be restored.

ErrorContext:   Details:     Primary key: LAYER_RESOURCE_ID = '23232', VLM_NR = '0'


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:267
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'getRscData', Source file 'AbsLayerVlmDataDbDriver.java', Line #66

Error message:                      Cannot read field "rscData" because the return value of "java.util.Map.get(Object)" is null

Call backtrace:

    Method                                   Native Class:Line number
    getRscData                               N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver$VlmParentObjects:66
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:153
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:39
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.

boedy · January 29, 2025, 1:51pm

I tried fixing the database by deleting records that it was complaining about, but now I’m not longer getting any hints on what to do next:

ERROR REPORT 6798A7C6-00000-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-28 09:48:12
Node:                               linstor-controller-547876f655-zjgkf
Thread:                             Main

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #104

Error message:                      Unhandled exception

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:104
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'allocateAfterDbLoad', Source file 'ExosMappingManager.java', Line #88

Error message:                      Cannot invoke "com.linbit.linstor.core.objects.StorPool.getDeviceProviderKind()" because "storPool" is null

Call backtrace:

    Method                                   Native Class:Line number
    allocateAfterDbLoad                      N      com.linbit.linstor.storage.utils.ExosMappingManager:88
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:666
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.

liniac · January 29, 2025, 10:48pm

There isn’t much you can do in this case besides restore the database from backup, it appears that LINSTOR is looking for a reference to a storage pool which is not specified within these records.

I see you’ve also commented on the Github issue here so I’ll link that as it provides some relevant discussion that might help future forum-goers:

https://github.com/LINBIT/linstor-server/issues/433

I won’t rehash what’s already stated in the linked Github issue but I will add that if you have any insight into what circumstances seem to lead to this state in your environment that may potentially translate to steps that can be used to reproduce, that data would be welcome if shared and may get us closer to a solve. As it stands, the errors shared suggest to me only that the database is corrupted, but not why. You mention for the failure earlier in this thread it was preceeded by the control plane going down, is that consistent in your experience? Any other factors of what you might be doing with the cluster or LINSTOR at the time?

I am linking the information for how to restore from backup here from the Piraeus project documentation for the sake of completeness:

https://github.com/piraeusdatastore/piraeus-operator/blob/v2/docs/how-to/restore-linstor-db.md

boedy · January 30, 2025, 12:51pm

Thanks for your response, @liniac.

Unfortunately, the last backup was made over a year ago, which suggests that backups are only created during upgrades. Reverting to that backup would result in significant data loss, which I’m not ready to accept. Instead, I’m attempting to delete all LINSTOR CRDs that were created x days before the incident, hoping this will allow me to restore a bootable state.

I don’t have concrete evidence, but I suspect this issue occurs when the control plane is either down or overloaded. It seems that certain writes to the Kubernetes API fail, leading to state corruption. Another possible explanation is that as the number of resources managed by the controller increases, timeouts occur when fetching resources. I’ve documented and reported one such occurrence here, which also references this forum post:

github.com/piraeusdatastore/linstor-csi

Snapshots not deleted properly, causing orphaned snapshots and eventual system overload

opened 12:43PM - 04 Sep 24 UTC

boedy

We’ve encountered a persistent issue where snapshots are not being properly dele…ted from the LINSTOR system, resulting in a large number of orphaned snapshots that are putting significant strain on our Kubernetes cluster. This problem has caused severe performance degradation and may have contributed to recent crashes in our LINSTOR controller. Last week, our cluster went down, likely due to this issue. When the cluster came back online, the LINSTOR controller was unable to start as the datastore seemed to have been corrupted. This issue has persisted across multiple controller restarts. I initially reported this on the [LINSTOR forum](https://forums.linbit.com/t/linstor-controller-just-went-down/269/6), where I also outlined the steps I took to get the controller running again. **Context** We are creating hourly snapshots via Velero which are retained for 7 days. However many snapshots are not being deleted correctly from LINSTOR, leading to a significant buildup of orphaned snapshots. Despite using a VolumeSnapshotClass with the deletion policy set to Delete, these snapshots remain in the LINSTOR system even after the corresponding VolumeSnapshotContent and PVC objects are deleted in Kubernetes. Over time, a large number of snapshots (approximately 2500+) accumulated in the LINSTOR system, though the corresponding PVCs and VolumeSnapshotContent objects no longer existed. Upon investigation, I found that our cluster had over 30,000 PropsContainer records related to these orphaned snapshots, which made operations slow and timeouts more frequent. This likely contributed to LINSTOR controller crashes and resource corruption. Running the command kubectl get propscontainers.internal.linstor.linbit.com | wc -l took more than 40 seconds to complete. I eventually used a script to manually clean up the orphaned snapshots, which reduced the PropsContainer records to around 838. However, the root cause of the snapshot deletion failure persists. One week later today, the issue has led to the following current state: ``` velero backup get | wc -l --> 28 linstor snapshot list | wc -l --> 912 k get propscontainers.internal.linstor.linbit.com | wc -l --> 12235 k get volumesnapshotcontent | wc -l --> 133 ``` **Context** - Velero version: 1.13.1 - LINSTOR CSI driver version: v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 - Piraeus v1.28.0 ```yaml apiVersion: snapshot.storage.k8s.io/v1 deletionPolicy: Delete driver: linstor.csi.linbit.com kind: VolumeSnapshotClass metadata: annotations: snapshot.storage.kubernetes.io/is-default-class: "true" name: default ``` **linstor-csi-constroller logs and snapshot of resources and snapshots** Unfortunatly the linstor controller restarted, which prevents me from fetching the error reports listed in the logs. ``` time="2024-09-02T09:13:19Z" level=info msg="deleting volume" linstorCSIComponent=client volume=pvc-23f48d11-f801-450e-af00-bc8a3c3174b1 time="2024-09-02T09:13:19Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete volume: Message: 'Node: h-fsn-ded1, Resource: pvc-23f48d11-f801-450e-af00-bc8a3c3174b1 preparing for deletion.'; Details: 'Node: h-fsn-ded1, Resource: pvc-23f48d11-f801-450e-af00-bc8a3c3174b1 UUID is: abf8bedc-375c-41d4-833d-10f33c534e25' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded1'' next error: Message: '(Node: 'h-fsn-ded4') Failed to create meta-data for DRBD volume pvc-23f48d11-f801-450e-af00-bc8a3c3174b1/0'; Reports: '[66B38BF5-0194E-001818]' next error: Message: 'Deletion of resource 'pvc-23f48d11-f801-450e-af00-bc8a3c3174b1' on node 'h-fsn-ded1' failed due to an unknown exception.'; Details: 'Node: h-fsn-ded1, Resource: pvc-23f48d11-f801-450e-af00-bc8a3c3174b1'; Reports: '[66CE38AA-00000-009040]'" linstorCSIComponent=driver method=/csi.v1.Controller/DeleteVolume nodeID= provisioner=linstor.csi.linbit.com req="volume_id:\"pvc-23f48d11-f801-450e-af00-bc8a3c3174b1\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 time="2024-09-02T09:13:38Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009041]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-6608a2a1-0d6a-4548-95ad-ee02facd1a88\" name:\"snapshot-e3432c8d-cfd0-4a5d-9546-1c9d21cf628e\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 time="2024-09-02T09:15:04Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009042]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-6608a2a1-0d6a-4548-95ad-ee02facd1a88\" name:\"snapshot-e3432c8d-cfd0-4a5d-9546-1c9d21cf628e\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 time="2024-09-02T09:15:06Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009043]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 time="2024-09-02T09:15:07Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009044]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 time="2024-09-02T09:15:09Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009045]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 time="2024-09-02T09:15:13Z" level=error msg="method failed" error="rpc error: code = Internal desc = failed to delete temporary snapshot ID: Message: 'Exception thrown.'; Details: 'com.linbit.linstor.transaction.TransactionException: Error creating rollback entry'; Reports: '[66CE38AA-00000-009046]'" linstorCSIComponent=driver method=/csi.v1.Controller/CreateSnapshot nodeID= provisioner=linstor.csi.linbit.com req="source_volume_id:\"pvc-9e20d899-1b94-46fb-80bc-f7c0df1801ea\" name:\"snapshot-ea77054e-1912-4846-af23-221edca35b78\" " resp="<nil>" version=v1.6.3-24ffba67ea151a0276bb418e65fd795b91779428 ``` [linstor-csi.log](https://github.com/user-attachments/files/16870345/linstor-csi.log) [resources.txt](https://github.com/user-attachments/files/16870347/resources.txt) [snapshots.txt](https://github.com/user-attachments/files/16870349/snapshots.txt)

boedy · January 30, 2025, 2:18pm

To my surprise, the deletion strategy actually worked! The LINSTOR controller is now able to boot again, and the cluster is back online. Below I’ll share more context and the steps I took to resolve the issue.

Context
We run LINSTOR in a K3s cluster, backed by an external MySQL database that stores all Kubernetes state data. While we hadn’t recently deployed new workloads, I noticed a significant number of manifests created in the last 43 days. Upon inspection, most of them were snapshot-related resources.

We use Velero for automated backups, and I had a feeling that the accumulation of these snapshots could be the leading cause of the corruption of the database. Which would be the same issue I reported here: GitHub Issue #290.

Resolution
To restore the cluster to a functional state, I executed a MySQL query to identify and delete LINSTOR-related resources created within the last 42 days. After backing up these records, I deleted them and restarted the LINSTOR controller. This successfully restored the cluster’s functionality.

Here’s the SQL query I used:

DELETE FROM `defaultdb`.`kine`
WHERE (
  `value` LIKE '%"creationTimestamp":"2025-01-%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-2%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-19%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-18%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-17%'
) 
AND (`name` LIKE '%/registry/internal.linstor.linbit.com%');

This action removed approximately 63% of all stored k8s data. 63,016 out of 101,327 rows were removed. After performing this cleanup and restarting the LINSTOR controller it was able to successfully boot.

Observations:
Based on this, I believe the sheer number of snapshot-related resources contributed to the instability of the LINSTOR cluster, eventually leading to corruption. As the total number of resources grows, the Kubernetes API becomes more sluggish, increasing timeouts and making the system unresponsive. This aligns with a previous issue I reported regarding snapshot cleanup.

For now, I’m considering disabling the current backup strategy to prevent this issue from happening again while the underlying problem remains unresolved. Given the high number of snapshot-related resources causing instability, I’ll need to evaluate alternative backup approaches or adjust retention policies to avoid overwhelming the cluster.

Topic		Replies	Views
Linstor Controller Crashing General kubernetes , drbd	1	37	May 16, 2025
Linstor controller crashes LINBIT SDS Integrations kubernetes	3	109	December 7, 2024
Could not connect to any LINSTOR controller (after HA) LINSTOR drbd	1	311	January 21, 2025
linstor-server 1.31.2 release Release Announcements drbd	0	35	June 11, 2025
Errors running linstor cli after split-brain issue LINSTOR drbd	3	87	January 7, 2025

Linstor-controller just went down

Related topics