Linstor-controller just went down

Hello I hope someone can help me :pray:

Our Linstor production cluster just went down and I’m not able to get it back up. The controller is in a crashloop spitting out this error message every time:

ERROR REPORT 66CE118B-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.28.0
Build ID:                           959382f7b4fb9436fefdd21dfa262e90318edaed
Build time:                         2024-07-11T10:21:06+00:00
Error time:                         2024-08-27 17:49:13
Node:                               linstor-controller-7b9c4ccd45-xk4lf
Thread:                             Main

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'loadCoreObjects', Source file 'DatabaseLoader.java', Line #680

Error message:                      Unknown error during loading data from DB

Call backtrace:

    Method                                   Native Class:Line number
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:680
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'loadAll', Source file 'AbsDatabaseDriver.java', Line #190

Error message:                      Failed to restore data

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:190
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625

Caused by:
==========

Category:                           Exception
Class name:                         ValueInUseException
Class canonical name:               com.linbit.ValueInUseException
Generated at:                       Method 'allocate', Source file 'DynamicNumberPoolImpl.java', Line #124

Error message:                      TCP port 7077 is already in use

Call backtrace:

    Method                                   Native Class:Line number
    allocate                                 N      com.linbit.linstor.numberpool.DynamicNumberPoolImpl:124
    <init>                                   N      com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:110
    genericCreate                            N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:268
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:231
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:49
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625


END OF ERROR REPORT.

LINSTOR’s complaint appears to be that it cannot allocate using port 7077 as it’s already being used. LINSTOR defaults to using ports 7000 - 7999 to allocate for resources, but if something is already running on a port and LINSTOR attempts to assign it, it will result in an error.

Is that the case in your situation?

If so, you are able to change the TcpPortAutoRange to a different range that works better for your stack. You mention the controller isn’t staying up though, can you describe the behavior you are seeing in more detail? Is this a LINSTOR in Kubernetes implementation? If not, what does your environment look like (number of controllers/satellites)?

Thanks for your prompt reply.

Fortunately, I was able to restore the LINSTOR controllers within 2 hours of the outage.

Upon further investigation, I found that there were multiple DRBD resource definitions attempting to use the same TCP port (7077), which led to the controller crash. Specifically, the following resource definitions were involved:

{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:48:18Z",
      "generation":1,
      "name":"c1c9d57513a2ed77a149efe7aa74b2a0155954b5c0c0da0cc0e82d6a1d6e7126",
      "uid":"a9cee388-eb77-4b0a-b421-7861ff475eb9"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-00A767E6-18BF-477D-8595-CDB9BA606B36",
      "resource_name_suffix":"",
      "secret":"yui4uCFI4Ke4wD8QIpvI",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}
{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:47:37Z",
      "generation":1,
      "name":"54a30c138618a3fe355253cae97291b0d4457f365b9754278759789bd630518c",
      "uid":"6385ca3a-8e4f-46c3-b285-34332e3779fb"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-90F08CA1-DDE8-4A82-9794-D50F411F8712",
      "resource_name_suffix":"",
      "secret":"63zoFVtp1aZkLHGNVVLv",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}

After detecting these conflicting resource definitions, I decided to delete them to resolve the issue. However, since there was no straightforward way to delete these entries directly via the LINSTOR controller, I had to search through all linstor CRD’s matching the PVC names listed above.

I methodically deleted these records, and after each deletion, I attempted to start the LINSTOR controller again. Each time the controller crashed, the new error reports provided additional hints on which CRD’s needed to be removed (whilst I created backups of the file incase I deleted the wrong files. This trial-and-error process was time-consuming, but by following the clues from each error report, I was eventually able to delete all the problematic entries and get the controller running successfully.

Reflecting on the cause of the issue, it seems that the data may have been corrupted somehow. Since I’m using the Kubernetes store, all information is stored within Kubernetes itself. I’m concerned about its reliability compared to using a dedicated database like PostgreSQL, which I believe offers more robust transactional support in case of failure like today. Ending up in a state like this shouldn’t be possible or the controller should be able to recover from it.

How can I prevent such issues from occurring in the future?

One thing I forgot to mention is that, prior to this issue, the control plane of the Kubernetes cluster went down. I believe this affected LINSTOR as it could not communicate with the K8S API. Having said that, I would expect LINSTOR to be able to resist such an outage without resulting in data corruption. Ideally, any transaction in progress during the outage should be rolled back to prevent issues like this.

Would the outcome have been different if PostgreSQL had been used instead of the Kubernetes store? Specifically, what would happen if PostgreSQL went down during a random outage? Would LINSTOR be more resilient in that scenario, or are there other steps that should be taken to protect against data corruption?

To confirm, which backend are using currently for LINSTOR? This can be displayed here:

kubectl exec deploy/linstor-op-cs-controller – cat /etc/linstor/linstor.toml

A LINSTOR cluster within k8s using an etcd database might run into problems like you’ve encountered, which is why the current guidance is to migrate to using the Kubernetes API directly to persist the cluster state. If that’s what you are already using, the explanation for why the controller behaved how it did may be different. But use of the k8s API as opposed to a separate database, I see as more resilient due to the tighter coupling.

#/etc/linstor/linstor.toml
[db]
  connection_url = "k8s"

We are running Linstor using the Piraeus operator on a K3S cluster with MySQL as backing its datastore.

We’re running into a similar issue again, where the entire LINSTOR cluster is down. The controller seems to be stuck in a boot loop, and from what I can tell, this appears to be caused by a corrupted LINSTOR state.

I’m looking for guidance on:
1. What might be causing this issue?
2. How can the LINSTOR state be restored in such scenarios?
3. How can we prevent this from happening in the future?

Below two error reports for reference:

ERROR REPORT 67979A9C-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-27 14:39:46
Node:                               linstor-controller-547876f655-8zspc
Thread:                             Main

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         LinStorDBRuntimeException
Class canonical name:               com.linbit.linstor.LinStorDBRuntimeException
Generated at:                       Method 'loadAll', Source file 'K8sCrdEngine.java', Line #267

Error message:                      Database entry of table LAYER_DRBD_VOLUMES could not be restored.

ErrorContext:   Details:     Primary key: LAYER_RESOURCE_ID = '23232', VLM_NR = '0'


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:267
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'getRscData', Source file 'AbsLayerVlmDataDbDriver.java', Line #66

Error message:                      Cannot read field "rscData" because the return value of "java.util.Map.get(Object)" is null

Call backtrace:

    Method                                   Native Class:Line number
    getRscData                               N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver$VlmParentObjects:66
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:153
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:39
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.
ERROR REPORT 67979A9C-00000-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-27 14:39:46
Node:                               linstor-controller-547876f655-8zspc
Thread:                             Main

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #104

Error message:                      Unhandled exception

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:104
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         LinStorDBRuntimeException
Class canonical name:               com.linbit.linstor.LinStorDBRuntimeException
Generated at:                       Method 'loadAll', Source file 'K8sCrdEngine.java', Line #267

Error message:                      Database entry of table LAYER_DRBD_VOLUMES could not be restored.

ErrorContext:   Details:     Primary key: LAYER_RESOURCE_ID = '23232', VLM_NR = '0'


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:267
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'getRscData', Source file 'AbsLayerVlmDataDbDriver.java', Line #66

Error message:                      Cannot read field "rscData" because the return value of "java.util.Map.get(Object)" is null

Call backtrace:

    Method                                   Native Class:Line number
    getRscData                               N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver$VlmParentObjects:66
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:153
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:39
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.

I tried fixing the database by deleting records that it was complaining about, but now I’m not longer getting any hints on what to do next:

ERROR REPORT 6798A7C6-00000-000001

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-01-28 09:48:12
Node:                               linstor-controller-547876f655-zjgkf
Thread:                             Main

============================================================

Reported error:
===============

Category:                           LinStorException
Class name:                         SystemServiceStartException
Class canonical name:               com.linbit.SystemServiceStartException
Generated at:                       Method 'startSystemServices', Source file 'ApplicationLifecycleManager.java', Line #104

Error message:                      Unhandled exception

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:104
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'allocateAfterDbLoad', Source file 'ExosMappingManager.java', Line #88

Error message:                      Cannot invoke "com.linbit.linstor.core.objects.StorPool.getDeviceProviderKind()" because "storPool" is null

Call backtrace:

    Method                                   Native Class:Line number
    allocateAfterDbLoad                      N      com.linbit.linstor.storage.utils.ExosMappingManager:88
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:666
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.

There isn’t much you can do in this case besides restore the database from backup, it appears that LINSTOR is looking for a reference to a storage pool which is not specified within these records.

I see you’ve also commented on the Github issue here so I’ll link that as it provides some relevant discussion that might help future forum-goers:

https://github.com/LINBIT/linstor-server/issues/433

I won’t rehash what’s already stated in the linked Github issue but I will add that if you have any insight into what circumstances seem to lead to this state in your environment that may potentially translate to steps that can be used to reproduce, that data would be welcome if shared and may get us closer to a solve. As it stands, the errors shared suggest to me only that the database is corrupted, but not why. You mention for the failure earlier in this thread it was preceeded by the control plane going down, is that consistent in your experience? Any other factors of what you might be doing with the cluster or LINSTOR at the time?

I am linking the information for how to restore from backup here from the Piraeus project documentation for the sake of completeness:

https://github.com/piraeusdatastore/piraeus-operator/blob/v2/docs/how-to/restore-linstor-db.md

Thanks for your response, @liniac.

Unfortunately, the last backup was made over a year ago, which suggests that backups are only created during upgrades. Reverting to that backup would result in significant data loss, which I’m not ready to accept. Instead, I’m attempting to delete all LINSTOR CRDs that were created x days before the incident, hoping this will allow me to restore a bootable state.

I don’t have concrete evidence, but I suspect this issue occurs when the control plane is either down or overloaded. It seems that certain writes to the Kubernetes API fail, leading to state corruption. Another possible explanation is that as the number of resources managed by the controller increases, timeouts occur when fetching resources. I’ve documented and reported one such occurrence here, which also references this forum post:

To my surprise, the deletion strategy actually worked! The LINSTOR controller is now able to boot again, and the cluster is back online. Below I’ll share more context and the steps I took to resolve the issue.

Context
We run LINSTOR in a K3s cluster, backed by an external MySQL database that stores all Kubernetes state data. While we hadn’t recently deployed new workloads, I noticed a significant number of manifests created in the last 43 days. Upon inspection, most of them were snapshot-related resources.

We use Velero for automated backups, and I had a feeling that the accumulation of these snapshots could be the leading cause of the corruption of the database. Which would be the same issue I reported here: GitHub Issue #290.

Resolution
To restore the cluster to a functional state, I executed a MySQL query to identify and delete LINSTOR-related resources created within the last 42 days. After backing up these records, I deleted them and restarted the LINSTOR controller. This successfully restored the cluster’s functionality.

Here’s the SQL query I used:

DELETE FROM `defaultdb`.`kine`
WHERE (
  `value` LIKE '%"creationTimestamp":"2025-01-%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-2%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-19%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-18%'
  OR `value` LIKE '%"creationTimestamp":"2024-12-17%'
) 
AND (`name` LIKE '%/registry/internal.linstor.linbit.com%');

This action removed approximately 63% of all stored k8s data. 63,016 out of 101,327 rows were removed. After performing this cleanup and restarting the LINSTOR controller it was able to successfully boot.

Observations:
Based on this, I believe the sheer number of snapshot-related resources contributed to the instability of the LINSTOR cluster, eventually leading to corruption. As the total number of resources grows, the Kubernetes API becomes more sluggish, increasing timeouts and making the system unresponsive. This aligns with a previous issue I reported regarding snapshot cleanup.

For now, I’m considering disabling the current backup strategy to prevent this issue from happening again while the underlying problem remains unresolved. Given the high number of snapshot-related resources causing instability, I’ll need to evaluate alternative backup approaches or adjust retention policies to avoid overwhelming the cluster.

2 Likes