Linstor Controller Crashing

Hi,
Our lab environment datastore has become corrupted and the Linstor controller is crashing at start up.

This initially began when “LINBIT Software Download Page For LINSTOR And DRBD Linux Driver” taint was constantly being added to the nodes, preventing workload scheduling.

A restart of all pods in the piraeus-datastore namespace was attempted, but the linstor controller wouldn’t restart and began to show error reports about missing resource definitions.

I attempted to delete the targeted custom resources in K8s API, but Linstor controller is still unhappy and complains about conflicting port for volumes.

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.29.2
Build ID:                           372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time:                         2024-11-05T11:22:22+00:00
Error time:                         2025-05-14 14:19:14
Node:                               linstor-controller-5dbb969895-mlgcq
Thread:                             Main

============================================================

Reported error:
===============

Category:                           Error
Class name:                         ImplementationError
Class canonical name:               com.linbit.ImplementationError
Generated at:                       Method 'loadCoreObjects', Source file 'DatabaseLoader.java', Line #680

Error message:                      Unknown error during loading data from DB

Call backtrace:

    Method                                   Native Class:Line number
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:680
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           LinStorException
Class name:                         DatabaseException
Class canonical name:               com.linbit.linstor.dbdrivers.DatabaseException
Generated at:                       Method 'loadAll', Source file 'AbsDatabaseDriver.java', Line #190

Error message:                      Failed to restore data

ErrorContext:


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:190
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627

Caused by:
==========

Category:                           Exception
Class name:                         ValueInUseException
Class canonical name:               com.linbit.ValueInUseException
Generated at:                       Method 'allocate', Source file 'DynamicNumberPoolImpl.java', Line #124

Error message:                      TCP port 7037 is already in use

Call backtrace:

    Method                                   Native Class:Line number
    allocate                                 N      com.linbit.linstor.numberpool.DynamicNumberPoolImpl:124
    <init>                                   N      com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:110
    genericCreate                            N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:268
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:231
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:49
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
    cacheAll                                 N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:728
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:88
    start                                    N      com.linbit.linstor.core.Controller:375
    main                                     N      com.linbit.linstor.core.Controller:627


END OF ERROR REPORT.

I’m curious to know how to restore the status so that Linstor Controller can run again.

I can provide the drbdadm dump to see what’s in there.

Ideally, the Linstor Controller should only throw warnings and not fail hard when the datastore resources are not mapped properly so that the API is accessible again. Otherwise, this becomes a real problem in production that creates long lasting outages. I’d rather have a few volumes not working than the whole system down.

Small update, deleting a few additional Kubernetes linbit resources associated with the problematic PVC solved the issue, and the Linstor Controller is running again.

I’d like to stress out the need for the Controller to be able to gracefully handle resource integrity check issues and make the API available instead of hard failing.