Linstor controller crashes

Hi,
my linstor controller went suddenly down after it was working fine in my lab environment the past 4 weeks.

I deployed Linstor on Kubernetes using Piraeus.

I’m not sure what needs to be done to restore it at this point.

The logs are:

$ kubectl logs -n piraeus-datastore              linstor-controller-65b6659b8f-8q77z
time="2024-12-06T02:42:30Z" level=info msg="running k8s-await-election" version=refs/tags/v0.4.1
time="2024-12-06T02:42:30Z" level=info msg="no status endpoint specified, will not be created"
I1206 02:42:30.421123       1 leaderelection.go:250] attempting to acquire leader lease piraeus-datastore/linstor-controller...
I1206 02:42:30.801558       1 leaderelection.go:260] successfully acquired lease piraeus-datastore/linstor-controller
time="2024-12-06T02:42:30Z" level=info msg="long live our new leader: 'linstor-controller-65b6659b8f-8q77z'!"
time="2024-12-06T02:42:30Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startController]'"
LINSTOR, Module Controller
Version:            1.29.2 (372c916b7d97fa10e8ea480b66ea3da665ab5849)
Build time:         2024-11-05T11:22:22+00:00 Log v2
Java Version:       17
Java VM:            Debian, Version 17.0.13+11-Debian-2deb12u1
Operating system:   Linux, Version 6.6.58-talos
Environment:        amd64, 96 processors, 30688 MiB memory reserved for allocations


System components initialization in progress

Loading configuration file "/etc/linstor/linstor.toml"
2024-12-06 02:42:31.454 [main] INFO  LINSTOR/Controller/ffffff SYSTEM - ErrorReporter DB version 1 found.
2024-12-06 02:42:31.455 [main] INFO  LINSTOR/Controller/ffffff SYSTEM - Log directory set to: '/var/log/linstor-controller'
2024-12-06 02:42:31.480 [main] INFO  LINSTOR/Controller/ffffff SYSTEM - Database type is Kubernetes-CRD
2024-12-06 02:42:31.480 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Loading API classes started.
2024-12-06 02:42:31.834 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - API classes loading finished: 354ms
2024-12-06 02:42:31.834 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dependency injection started.
2024-12-06 02:42:31.845 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
2024-12-06 02:42:31.845 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
2024-12-06 02:42:31.845 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
2024-12-06 02:42:31.855 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
2024-12-06 02:42:31.855 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
2024-12-06 02:42:31.856 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" was successful
2024-12-06 02:42:32.535 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Dependency injection finished: 701ms
2024-12-06 02:42:32.535 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Cryptography provider: Using default cryptography module
2024-12-06 02:42:32.765 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Initializing authentication subsystem
2024-12-06 02:42:32.992 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - SpaceTrackingService: Instance added as a system service
2024-12-06 02:42:32.993 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
2024-12-06 02:42:32.993 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Initializing the k8s crd database connector
2024-12-06 02:42:32.993 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Kubernetes-CRD connection URL is "k8s"
2024-12-06 02:42:35.257 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Starting service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
2024-12-06 02:42:35.268 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Security objects load from database is in progress
2024-12-06 02:42:36.741 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Security objects load from database completed
2024-12-06 02:42:36.741 [Main] INFO  LINSTOR/Controller/ffffff SYSTEM - Core objects load from database is in progress
2024-12-06 02:42:37.028 [Main] ERROR LINSTOR/Controller/ffffff SYSTEM - Unknown error during loading data from DB [Report number 67526497-00000-000000]

2024-12-06 02:42:37.030 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutdown in progress
2024-12-06 02:42:37.031 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutting down service instance 'EbsStatusPoll' of type EbsStatusPoll
2024-12-06 02:42:37.031 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Waiting for service instance 'EbsStatusPoll' to complete shutdown
2024-12-06 02:42:37.031 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutting down service instance 'ScheduleBackupService' of type ScheduleBackupService
2024-12-06 02:42:37.031 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Waiting for service instance 'ScheduleBackupService' to complete shutdown
2024-12-06 02:42:37.031 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutting down service instance 'SpaceTrackingService' of type SpaceTrackingService
2024-12-06 02:42:37.032 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Waiting for service instance 'SpaceTrackingService' to complete shutdown
2024-12-06 02:42:37.032 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutting down service instance 'TaskScheduleService' of type TaskScheduleService
2024-12-06 02:42:37.032 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Waiting for service instance 'TaskScheduleService' to complete shutdown
2024-12-06 02:42:37.032 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutting down service instance 'K8sCrdDatabaseService' of type K8sCrdDatabaseService
2024-12-06 02:42:37.035 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Waiting for service instance 'K8sCrdDatabaseService' to complete shutdown
2024-12-06 02:42:37.035 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService
2024-12-06 02:42:37.035 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown
2024-12-06 02:42:37.035 [Thread-2] INFO  LINSTOR/Controller/e2f234 SYSTEM - Shutdown complete
time="2024-12-06T02:42:37Z" level=fatal msg="failed to run" err="exit status 199"

That sounds like there is corruption in the CRDs LINSTOR uses to store its state.

You might be able to find that error report on the system that hosted the LINSTOR controller container:

find /var/lib/kubelet/ -iname *67526497-00000-000000* -exec cat {} \;

If you can find that, it might contain clues as to what is going on, but it sounds like this is a “restore CRDs from backup” situation.

Yeah, I had to clean everything on the K8s cluster and reinstall linstor fresh. I tried to delete a few K8s CRs but couldn’t find exactly which one was problematic.

I didn’t have the time to look on the host to see if there was a report file available. I’ll do that next time.

Fortunately, this was only a lab environment, but obviously it wont be acceptable for any production workloads. I really hope we can get to the root of it if it happens again.

Is it the first time that you’ve heard of that? How production ready is the Kubernetes datastore backend for Linstor?

It’s not the first time I’ve heard this, but it’s one of maybe three that come to mind.

A larger scale production user encountered a similar situation around when LINSTOR first moved from an etcd to CRD backed database. That situation, if I recall correctly, was due to conflicting entries in LINSTOR’s database after restoring a k8s cluster backup, and was ultimately resolvable without resorting to a LINSTOR database backup.

LINSTOR does take and store backups of its database in Kubernetes secrets before upgrading, but we definitely recommend backing up LINSTOR’s CRDs more regularly than that.

If you don’t have some dedicated backup solution in Kubernetes in mind, there are some helpful commands in the upstream Piraeus project’s docs that could be used in a “manual” backup / restore process.

Hopefully you don’t encounter this again, but if you do, definitely report back.