Working linstor/drbd proxmox cluster fails to create new volumes

I have a proxmox prod cluster with DRBD/linstor which has been working flawlessly for the past year. Suddenly I cannot create new VMs or add disks to existing VMs.

When trying to add a new disk to a VM it fails with:

update VM 121: -scsi1 linstor_storage:40,iothread=on

  Trying to create diskful resource (pm-ca42ac92) on (wirt23a).
  Diskfull assignment on wirt23a failed, let's autoplace it.
TASK ERROR: API Return-Code: 500. Message:
  Could not autoplace resource pm-ca42ac92, because:
     read timeout at /usr/share/perl5/Net/HTTP/ line 274.
     at /usr/share/perl5/PVE/Storage/Custom/ line 433.

I searched this forum and found Cannot create new VM - fresh install

wirt23a:~# linstor rg l
┊ ResourceGroup ┊ SelectFilter                ┊ VlmNrs ┊ Description ┊
┊ DfltRscGrp    ┊ PlaceCount: 2               ┊        ┊             ┊
┊ pve-rg        ┊ PlaceCount: 2               ┊ 0      ┊             ┊
┊               ┊ StoragePool(s): pve-storage ┊        ┊             ┊

Create a new rg fails as well:

wirt23a:~# linstor rg spawn pve-rg testres_0 1GiB
Error: Socket timeout, no data received for more than 300s.

After the 5 minutes I see the resource in the volumes list:

wirt23a:~# linstor v l -r testres_0
┊ Node    ┊ Resource  ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊    State ┊
┊ wirt23a ┊ testres_0 ┊ pve-storage ┊     0 ┊    1059 ┊ None          ┊           ┊        ┊  Unknown ┊
┊ wirt23b ┊ testres_0 ┊ pve-storage ┊     0 ┊    1059 ┊ /dev/drbd1059 ┊   315 KiB ┊ Unused ┊ UpToDate ┊

on wirt23b I see:

wirt23b:~# linstor  r l -r testres_0
┊ ResourceName ┊ Node    ┊ Port ┊ Usage  ┊ Conns               ┊    State ┊ CreatedOn           ┊
┊ testres_0    ┊ wirt23a ┊ 7059 ┊        ┊                     ┊  Unknown ┊                     ┊
┊ testres_0    ┊ wirt23b ┊ 7059 ┊ Unused ┊ Connecting(wirt23a) ┊ UpToDate ┊ 2024-11-29 14:06:21 ┊

drbdadm status shows connecting on wirt23b, but doesn’t see the resource on wirt23a.

Has anyone any hints how to solve this issue?

PS: The nodes seem happy:

wirt23b:~# linstor  n l
┊ Node    ┊ NodeType ┊ Addresses                ┊ State  ┊
┊ wirt23a ┊ COMBINED ┊ (PLAIN) ┊ Online ┊
┊ wirt23b ┊ COMBINED ┊ (PLAIN) ┊ Online ┊

I tried rebooting wirt23a and then the resources have synced.

Now I removed the test-resource and tried to recreate it and it works…

sorry for the hassle. I have no idea what was wrong… Hope it stays this way…

The only difference to a test-cluster I see is that the linstor_db volume is not connected on my prod cluster:

wirt23a:~# linstor v l -r linstor_db 
┊ Node    ┊ Resource   ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse ┊    State ┊
┊ wirt23a ┊ linstor_db ┊ pool_md1    ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊   204 MiB ┊ InUse ┊ UpToDate ┊
┊ wirt23b ┊ linstor_db ┊ pool_md1    ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊   204 MiB ┊ InUse ┊ UpToDate ┊

wirt23a:~# drbdadm status linstor_db
linstor_db role:Primary
  disk:UpToDate open:yes
  wirt23b connection:Connecting

and on wirt23b:

wirt23b:~# drbdadm status linstor_db
linstor_db role:Primary
  disk:UpToDate open:yes
  wirt23a connection:StandAlone