Hi,
Our lab environment datastore has become corrupted and the Linstor controller is crashing at start up.
This initially began when “LINBIT Software Download Page For LINSTOR And DRBD Linux Driver” taint was constantly being added to the nodes, preventing workload scheduling.
A restart of all pods in the piraeus-datastore namespace was attempted, but the linstor controller wouldn’t restart and began to show error reports about missing resource definitions.
I attempted to delete the targeted custom resources in K8s API, but Linstor controller is still unhappy and complains about conflicting port for volumes.
============================================================
Application: LINBIT? LINSTOR
Module: Controller
Version: 1.29.2
Build ID: 372c916b7d97fa10e8ea480b66ea3da665ab5849
Build time: 2024-11-05T11:22:22+00:00
Error time: 2025-05-14 14:19:14
Node: linstor-controller-5dbb969895-mlgcq
Thread: Main
============================================================
Reported error:
===============
Category: Error
Class name: ImplementationError
Class canonical name: com.linbit.ImplementationError
Generated at: Method 'loadCoreObjects', Source file 'DatabaseLoader.java', Line #680
Error message: Unknown error during loading data from DB
Call backtrace:
Method Native Class:Line number
loadCoreObjects N com.linbit.linstor.dbdrivers.DatabaseLoader:680
loadCoreObjects N com.linbit.linstor.core.DbDataInitializer:169
initialize N com.linbit.linstor.core.DbDataInitializer:101
startSystemServices N com.linbit.linstor.core.ApplicationLifecycleManager:88
start N com.linbit.linstor.core.Controller:375
main N com.linbit.linstor.core.Controller:627
Caused by:
==========
Category: LinStorException
Class name: DatabaseException
Class canonical name: com.linbit.linstor.dbdrivers.DatabaseException
Generated at: Method 'loadAll', Source file 'AbsDatabaseDriver.java', Line #190
Error message: Failed to restore data
ErrorContext:
Call backtrace:
Method Native Class:Line number
loadAll N com.linbit.linstor.dbdrivers.AbsDatabaseDriver:190
cacheAll N com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
cacheAll N com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
loadLayerObects N com.linbit.linstor.dbdrivers.DatabaseLoader:728
loadCoreObjects N com.linbit.linstor.dbdrivers.DatabaseLoader:640
loadCoreObjects N com.linbit.linstor.core.DbDataInitializer:169
initialize N com.linbit.linstor.core.DbDataInitializer:101
startSystemServices N com.linbit.linstor.core.ApplicationLifecycleManager:88
start N com.linbit.linstor.core.Controller:375
main N com.linbit.linstor.core.Controller:627
Caused by:
==========
Category: Exception
Class name: ValueInUseException
Class canonical name: com.linbit.ValueInUseException
Generated at: Method 'allocate', Source file 'DynamicNumberPoolImpl.java', Line #124
Error message: TCP port 7037 is already in use
Call backtrace:
Method Native Class:Line number
allocate N com.linbit.linstor.numberpool.DynamicNumberPoolImpl:124
<init> N com.linbit.linstor.storage.data.adapter.drbd.DrbdRscDfnData:110
genericCreate N com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:268
load N com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:231
load N com.linbit.linstor.core.objects.LayerDrbdRscDfnDbDriver:49
loadAll N com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:238
loadAll N com.linbit.linstor.dbdrivers.AbsDatabaseDriver:180
cacheAll N com.linbit.linstor.core.objects.AbsLayerRscDfnDataDbDriver:55
cacheAll N com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:261
loadLayerObects N com.linbit.linstor.dbdrivers.DatabaseLoader:728
loadCoreObjects N com.linbit.linstor.dbdrivers.DatabaseLoader:640
loadCoreObjects N com.linbit.linstor.core.DbDataInitializer:169
initialize N com.linbit.linstor.core.DbDataInitializer:101
startSystemServices N com.linbit.linstor.core.ApplicationLifecycleManager:88
start N com.linbit.linstor.core.Controller:375
main N com.linbit.linstor.core.Controller:627
END OF ERROR REPORT.
I’m curious to know how to restore the status so that Linstor Controller can run again.
I can provide the drbdadm dump to see what’s in there.
Ideally, the Linstor Controller should only throw warnings and not fail hard when the datastore resources are not mapped properly so that the API is accessible again. Otherwise, this becomes a real problem in production that creates long lasting outages. I’d rather have a few volumes not working than the whole system down.