Linstor-controller just went down

Hello I hope someone can help me :pray:

Our Linstor production cluster just went down and I’m not able to get it back up. The controller is in a crashloop spitting out this error message every time:

ERROR REPORT 66CE118B-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.28.0
Build ID:                           959382f7b4fb9436fefdd21dfa262e90318edaed
Build time:                         2024-07-11T10:21:06+00:00
Error time:                         2024-08-27 17:49:13
Node:                               linstor-controller-7b9c4ccd45-xk4lf
Thread:                             Main

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'loadCoreObjects', Source file 'DatabaseLoader.java', Line #680

Error message:                      Unknown error during loading data from DB

Call backtrace:

    Method                                   Native Class:Line number
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:680
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'loadAll', Source file 'AbsDatabaseDriver.java', Line #190

Error message:                      Failed to restore data

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:190
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625

Caused by:
==========

Category:                           Exception
Class name:                         ValueInUseException
Class canonical name:               com.linbit.ValueInUseException
Generated at:                       Method 'allocate', Source file 'DynamicNumberPoolImpl.java', Line #124

Error message:                      TCP port 7077 is already in use

Call backtrace:

    Method                                   Native Class:Line number
    allocate                                 N      com.linbit.linstor.numberpool.DynamicNumberPoolImpl:124
    <init>                                   N      com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:110
    genericCreate                            N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:268
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:231
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:49
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:374
    main                                     N      com.linbit.linstor.core.Controller:625


END OF ERROR REPORT.

LINSTOR’s complaint appears to be that it cannot allocate using port 7077 as it’s already being used. LINSTOR defaults to using ports 7000 - 7999 to allocate for resources, but if something is already running on a port and LINSTOR attempts to assign it, it will result in an error.

Is that the case in your situation?

If so, you are able to change the TcpPortAutoRange to a different range that works better for your stack. You mention the controller isn’t staying up though, can you describe the behavior you are seeing in more detail? Is this a LINSTOR in Kubernetes implementation? If not, what does your environment look like (number of controllers/satellites)?

Thanks for your prompt reply.

Fortunately, I was able to restore the LINSTOR controllers within 2 hours of the outage.

Upon further investigation, I found that there were multiple DRBD resource definitions attempting to use the same TCP port (7077), which led to the controller crash. Specifically, the following resource definitions were involved:

{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:48:18Z",
      "generation":1,
      "name":"c1c9d57513a2ed77a149efe7aa74b2a0155954b5c0c0da0cc0e82d6a1d6e7126",
      "uid":"a9cee388-eb77-4b0a-b421-7861ff475eb9"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-00A767E6-18BF-477D-8595-CDB9BA606B36",
      "resource_name_suffix":"",
      "secret":"yui4uCFI4Ke4wD8QIpvI",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}
{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:47:37Z",
      "generation":1,
      "name":"54a30c138618a3fe355253cae97291b0d4457f365b9754278759789bd630518c",
      "uid":"6385ca3a-8e4f-46c3-b285-34332e3779fb"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-90F08CA1-DDE8-4A82-9794-D50F411F8712",
      "resource_name_suffix":"",
      "secret":"63zoFVtp1aZkLHGNVVLv",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}

After detecting these conflicting resource definitions, I decided to delete them to resolve the issue. However, since there was no straightforward way to delete these entries directly via the LINSTOR controller, I had to search through all linstor CRD’s matching the PVC names listed above.

I methodically deleted these records, and after each deletion, I attempted to start the LINSTOR controller again. Each time the controller crashed, the new error reports provided additional hints on which CRD’s needed to be removed (whilst I created backups of the file incase I deleted the wrong files. This trial-and-error process was time-consuming, but by following the clues from each error report, I was eventually able to delete all the problematic entries and get the controller running successfully.

Reflecting on the cause of the issue, it seems that the data may have been corrupted somehow. Since I’m using the Kubernetes store, all information is stored within Kubernetes itself. I’m concerned about its reliability compared to using a dedicated database like PostgreSQL, which I believe offers more robust transactional support in case of failure like today. Ending up in a state like this shouldn’t be possible or the controller should be able to recover from it.

How can I prevent such issues from occurring in the future?

One thing I forgot to mention is that, prior to this issue, the control plane of the Kubernetes cluster went down. I believe this affected LINSTOR as it could not communicate with the K8S API. Having said that, I would expect LINSTOR to be able to resist such an outage without resulting in data corruption. Ideally, any transaction in progress during the outage should be rolled back to prevent issues like this.

Would the outcome have been different if PostgreSQL had been used instead of the Kubernetes store? Specifically, what would happen if PostgreSQL went down during a random outage? Would LINSTOR be more resilient in that scenario, or are there other steps that should be taken to protect against data corruption?

To confirm, which backend are using currently for LINSTOR? This can be displayed here:

kubectl exec deploy/linstor-op-cs-controller – cat /etc/linstor/linstor.toml

A LINSTOR cluster within k8s using an etcd database might run into problems like you’ve encountered, which is why the current guidance is to migrate to using the Kubernetes API directly to persist the cluster state. If that’s what you are already using, the explanation for why the controller behaved how it did may be different. But use of the k8s API as opposed to a separate database, I see as more resilient due to the tighter coupling.

#/etc/linstor/linstor.toml
[db]
  connection_url = "k8s"

We are running Linstor using the Piraeus operator on a K3S cluster with MySQL as backing its datastore.