Thanks for your prompt reply.
Fortunately, I was able to restore the LINSTOR controllers within 2 hours of the outage.
Upon further investigation, I found that there were multiple DRBD resource definitions attempting to use the same TCP port (7077), which led to the controller crash. Specifically, the following resource definitions were involved:
{
"apiVersion":"internal.linstor.linbit.com/v1-27-1",
"kind":"LayerDrbdResourceDefinitions",
"metadata":{
"creationTimestamp":"2024-08-27T16:48:18Z",
"generation":1,
"name":"c1c9d57513a2ed77a149efe7aa74b2a0155954b5c0c0da0cc0e82d6a1d6e7126",
"uid":"a9cee388-eb77-4b0a-b421-7861ff475eb9"
},
"spec":{
"al_stripe_size":32,
"al_stripes":1,
"peer_slots":7,
"resource_name":"PVC-00A767E6-18BF-477D-8595-CDB9BA606B36",
"resource_name_suffix":"",
"secret":"yui4uCFI4Ke4wD8QIpvI",
"snapshot_name":"",
"tcp_port":7077,
"transport_type":"IP"
}
}
{
"apiVersion":"internal.linstor.linbit.com/v1-27-1",
"kind":"LayerDrbdResourceDefinitions",
"metadata":{
"creationTimestamp":"2024-08-27T16:47:37Z",
"generation":1,
"name":"54a30c138618a3fe355253cae97291b0d4457f365b9754278759789bd630518c",
"uid":"6385ca3a-8e4f-46c3-b285-34332e3779fb"
},
"spec":{
"al_stripe_size":32,
"al_stripes":1,
"peer_slots":7,
"resource_name":"PVC-90F08CA1-DDE8-4A82-9794-D50F411F8712",
"resource_name_suffix":"",
"secret":"63zoFVtp1aZkLHGNVVLv",
"snapshot_name":"",
"tcp_port":7077,
"transport_type":"IP"
}
}
After detecting these conflicting resource definitions, I decided to delete them to resolve the issue. However, since there was no straightforward way to delete these entries directly via the LINSTOR controller, I had to search through all linstor CRD’s matching the PVC names listed above.
I methodically deleted these records, and after each deletion, I attempted to start the LINSTOR controller again. Each time the controller crashed, the new error reports provided additional hints on which CRD’s needed to be removed (whilst I created backups of the file incase I deleted the wrong files. This trial-and-error process was time-consuming, but by following the clues from each error report, I was eventually able to delete all the problematic entries and get the controller running successfully.
Reflecting on the cause of the issue, it seems that the data may have been corrupted somehow. Since I’m using the Kubernetes store, all information is stored within Kubernetes itself. I’m concerned about its reliability compared to using a dedicated database like PostgreSQL, which I believe offers more robust transactional support in case of failure like today. Ending up in a state like this shouldn’t be possible or the controller should be able to recover from it.
How can I prevent such issues from occurring in the future?