Linstor-controller just went down

boedy · August 27, 2024, 10:06pm

Thanks for your prompt reply.

Fortunately, I was able to restore the LINSTOR controllers within 2 hours of the outage.

Upon further investigation, I found that there were multiple DRBD resource definitions attempting to use the same TCP port (7077), which led to the controller crash. Specifically, the following resource definitions were involved:

{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:48:18Z",
      "generation":1,
      "name":"c1c9d57513a2ed77a149efe7aa74b2a0155954b5c0c0da0cc0e82d6a1d6e7126",
      "uid":"a9cee388-eb77-4b0a-b421-7861ff475eb9"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-00A767E6-18BF-477D-8595-CDB9BA606B36",
      "resource_name_suffix":"",
      "secret":"yui4uCFI4Ke4wD8QIpvI",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}

{
   "apiVersion":"internal.linstor.linbit.com/v1-27-1",
   "kind":"LayerDrbdResourceDefinitions",
   "metadata":{
      "creationTimestamp":"2024-08-27T16:47:37Z",
      "generation":1,
      "name":"54a30c138618a3fe355253cae97291b0d4457f365b9754278759789bd630518c",
      "uid":"6385ca3a-8e4f-46c3-b285-34332e3779fb"
   },
   "spec":{
      "al_stripe_size":32,
      "al_stripes":1,
      "peer_slots":7,
      "resource_name":"PVC-90F08CA1-DDE8-4A82-9794-D50F411F8712",
      "resource_name_suffix":"",
      "secret":"63zoFVtp1aZkLHGNVVLv",
      "snapshot_name":"",
      "tcp_port":7077,
      "transport_type":"IP"
   }
}

After detecting these conflicting resource definitions, I decided to delete them to resolve the issue. However, since there was no straightforward way to delete these entries directly via the LINSTOR controller, I had to search through all linstor CRD’s matching the PVC names listed above.

I methodically deleted these records, and after each deletion, I attempted to start the LINSTOR controller again. Each time the controller crashed, the new error reports provided additional hints on which CRD’s needed to be removed (whilst I created backups of the file incase I deleted the wrong files. This trial-and-error process was time-consuming, but by following the clues from each error report, I was eventually able to delete all the problematic entries and get the controller running successfully.

Reflecting on the cause of the issue, it seems that the data may have been corrupted somehow. Since I’m using the Kubernetes store, all information is stored within Kubernetes itself. I’m concerned about its reliability compared to using a dedicated database like PostgreSQL, which I believe offers more robust transactional support in case of failure like today. Ending up in a state like this shouldn’t be possible or the controller should be able to recover from it.

How can I prevent such issues from occurring in the future?

Topic		Replies	Views
Linstor controller crashes LINBIT SDS Integrations kubernetes	3	221	December 7, 2024
Linstor Controller Crashing General kubernetes , drbd	1	128	May 16, 2025
Former primary node stuck in emergency mode after reboot DRBD Reactor kubernetes , drbd , drbd-reactor	6	114	December 17, 2025
How to install Reactor for Linstor-controller-HA in a k8s environment? LINSTOR kubernetes , drbd	3	133	December 10, 2024
LINSTOR Operator v2.8.1 Release Announcements kubernetes , drbd	0	48	April 10, 2025

Linstor-controller just went down

Related topics