Linstor DB/Controller HA

Hi, test environment consist of

Ubuntu-22.04 of 3 hosts and linstor version 1.30.2…

I am completely new to Linstor…and first time configuring Controller HA…

Please help, I had to restart the host1 which was the primary. for testing purpose I disconnected the HA LAN port on Host1, I connected HA LAN back but I see the message on the host1 console “drbd linstor_db/0 drbd1001 host2: I shall become SyncTarget, but I am Primary”

On host 1:

Jan 15 15:31:47 host1 kernel: drbd linstor_db/0 drbd1001 host2: uuid_compare()=target-use-bitmap by rule=bitmap-peer
Jan 15 15:31:47 host1 kernel: drbd linstor_db/0 drbd1001 host2: I shall become SyncTarget, but I am primary!
Jan 15 15:31:47 host1 kernel: drbd linstor_db: Aborting cluster-wide state change 3045118974 (20ms) rv = -28
Jan 15 15:31:47 host1 kernel: drbd linstor_db: Preparing cluster-wide state change 2942904539: 0->2 role( Primary ) conn( Connected )
Jan 15 15:31:47 host1 kernel: drbd linstor_db/0 drbd1001 host3: self E2E4AEF6E3609C9E:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:0
Jan 15 15:31:47 host1 kernel: drbd linstor_db/0 drbd1001 host3: peer's exposed UUID: B35C6B4E92FE7D26
Jan 15 15:31:47 host1 kernel: drbd linstor_db: Declined by peer host3 (id: 2), see the kernel log there
Jan 15 15:31:47 host1 kernel: drbd linstor_db: Aborting cluster-wide state change 2942904539 (24ms) rv = -10
Jan 15 15:31:48 host1 kernel: drbd linstor_db: Preparing cluster-wide state change 2679477677: 0->1 role( Primary ) conn( Connected )
Jan 15 15:31:48 host1 kernel: drbd linstor_db/0 drbd1001 host2: drbd_sync_handshake:
Jan 15 15:31:48 host1 kernel: drbd linstor_db/0 drbd1001 host2: self E2E4AEF6E3609C9E:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:520
Jan 15 15:31:48 host1 kernel: drbd linstor_db/0 drbd1001 host2: peer B35C6B4E92FE7D27:E2E4AEF6E3609C9E:0000000000000000:0000000000000000 bits:29 flags:112
0

on host2:

Jan 15 15:32:57 host2 kernel: drbd linstor_db: Preparing remote state change 3226928029: 0->1 role( Primary ) conn( Connected )
Jan 15 15:32:57 host2 kernel: drbd linstor_db: State change failed: Multiple primaries not allowed by config (-1)
Jan 15 15:32:57 host2 kernel: drbd linstor_db/0 drbd1001 host1: drbd_sync_handshake:
Jan 15 15:32:57 host2 kernel: drbd linstor_db/0 drbd1001 host1: self B35C6B4E92FE7D27:E2E4AEF6E3609C9E:0000000000000000:0000000000000000 bits:29 flags:120
Jan 15 15:32:57 host2 kernel: drbd linstor_db/0 drbd1001 host1: peer E2E4AEF6E3609C9E:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:1520
Jan 15 15:32:57 host2 kernel: drbd linstor_db/0 drbd1001 host1: uuid_compare()=source-use-bitmap by rule=bitmap-self
Jan 15 15:32:57 host2 kernel: drbd linstor_db host1: Aborting remote state change 3226928029
Jan 15 15:32:58 host2 kernel: drbd linstor_db: Preparing remote state change 2473902529: 0->2 role( Primary ) conn( Connected )
Jan 15 15:32:58 host2 kernel: drbd linstor_db host3: Aborting remote state change 2473902529
Jan 15 15:32:58 host2 kernel: drbd linstor_db: Preparing remote state change 1893362560: 0->1 role( Primary ) conn( Connected )
Jan 15 15:32:58 host2 kernel: drbd linstor_db: State change failed: Multiple primaries not allowed by config (-1)
Jan 15 15:32:58 host2 kernel: drbd linstor_db/0 drbd1001 host1: drbd_sync_handshake:
Jan 15 15:32:58 host2 kernel: drbd linstor_db/0 drbd1001 host1: self B35C6B4E92FE7D27:E2E4AEF6E3609C9E:0000000000000000:0000000000000000 bits:29 flags:120
Jan 15 15:32:58 host2 kernel: drbd linstor_db/0 drbd1001 host1: peer E2E4AEF6E3609C9E:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:1520
Jan 15 15:32:58 host2 kernel: drbd linstor_db/0 drbd1001 host1: uuid_compare()=source-use-bitmap by rule=bitmap-self
Jan 15 15:32:58 host2 kernel: drbd linstor_db host1: Aborting remote state change 1893362560
Jan 15 15:32:59 host2 kernel: drbd linstor_db: Preparing remote state change 1498093589: 0->2 role( Primary ) conn( Connected )
Jan 15 15:32:59 host2 kernel: drbd linstor_db host3: Aborting remote state change 1498093589

on host3:

#journalctl -xe
Jan 15 15:05:41 host3 kernel: drbd linstor_db host1: Aborting remote state change 2661701913
Jan 15 15:05:42 host3 kernel: drbd linstor_db: Preparing remote state change 825716938: 0->1 role( Primary ) conn( Connected )
Jan 15 15:05:42 host3 kernel: drbd linstor_db host2: Aborting remote state change 825716938
Jan 15 15:05:42 host3 kernel: drbd linstor_db: Preparing remote state change 669543369: 0->2 role( Primary ) conn( Connected )
Jan 15 15:05:42 host3 kernel: drbd linstor_db/0 drbd1001 host1: my exposed UUID: B35C6B4E92FE7D26
Jan 15 15:05:42 host3 kernel: drbd linstor_db/0 drbd1001 host1: peer E2E4AEF6E3609C9E:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:1520
Jan 15 15:05:42 host3 kernel: drbd linstor_db/0 drbd1001 host1: Downgrading joining peer's disk as its data is older
#root@host3:/# drbdadm status
linstor_db role:Secondary
  disk:Diskless open:no
  host1 connection:Connecting
  host2 role:Primary
    peer-disk:UpToDate

root@host2:~# drbdadm status
linstor_db role:Primary
  disk:UpToDate open:yes
  host1 connection:Connecting
  host3 role:Secondary
    peer-disk:Diskless

root@host1:~# drbdadm status
linstor_db role:Primary
  disk:UpToDate quorum:no open:no
  host2 connection:Connecting
  host3 connection:Connecting

DRBD is not connecting, But I am able to ping between all 3 hosts

Plz help

Thanks in advance

Looks like you’ve split-brained the linstor_db resource. DRBD’s quorum options should have prevented this. Did you configure the resource-group, before spawning the linstor_db resource, as outlined here? https://linbit.com/drbd-user-guide/linstor-guide-1_0-en/#s-linstor_ha-creating-res-group-for-ha-database-storage

You’ll likely need to manually repair this split-brain before proceeding. If you have the drbd-utils installed on the nodes you can follow the instructions here: https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-resolve-split-brain Please note you may need to use a drbdadm secondary --force due to the loss of quorum.
If you don’t wish to manually resolve the split-brain, I would just reboot host1 to force it secondary, then delete and recreate the resource on host1.

HI, Thankyou very much for your reply…yes I did follow the same doc/link to setup the resource…All was working fine, I just created a possible scenario to see what happens if the Primary node disconnects, I will go through the repair instructions…

would be of great help if there is any document of possible scenario/situation, how to the test the 3 node(Primary/Secondary/Diskless) cluster and understand the troubleshooting process to repair and recover from any situation without loss of data

Thanks in Advance…