Is this a recommended setup:
RHEL 9 based servers.
3x Dell Servers with hardware raid controller creates 20 TB disks on each server.
DRBD on top of the raid created /dev/sdb virtual disk configured with:
resource kvm {
device minor 1;
disk "/dev/drbd1";
meta-disk internal;
net {
protocol C;
allow-two-primaries yes;
}
options {
auto-promote no;
quorum majority;
}
on lab-kvm-01 {
address 10.0.0.1:7788;
node-id 1;
}
on lab-kvm-02 {
address 10.0.0.2:7788;
node-id 2;
}
on lab-kvm-03 {
address 10.0.0.3:7788;
node-id 3;
}
connection-mesh {
hosts lab-kvm-01 lab-kvm-02 lab-kvm-03;
}
}
sudo drbdadm create-md kvm
sudo drbdadm up kvm
sudo drbdadm primary kvm --force
sudo systemctl enable --now drbd@kvm.target
sudo drbdadm status
LVM PV created on top of /dev/drbd1
sudo pvcreate /dev/drbd1
LVM VG created
sudo vgcreate storage /dev/drbd1
LVM LV thinpool created
sudo lvcreate -l 100%FREE -T storage/lvthinpool
Finally, each VM is a separate LV within the thinpool.
sudo lvcreate -V 50G -T storage/lvthinpool -n lab-vmserver-01
Live migrations work. Disk space is easily increased as needed for the VMs. Everything “just works”. Great setup, really like how DRBD keeps the backend block device(s) synced.
Now comes the problem. Reboot a host. After running for days/weeks, without issues, would like to patch and reboot a host. Manually migrate all the VMs over to the -01 node, and ready to reboot -03 node. When -03 comes back up, I manually check and bring everything back online:
sudo drbdadm primary kvm --force
sudo vgchange -ay storage
sudo lvscan
We can see all the LVMs, all marked as ACTIVE and ready to be failed over from the -01 node. However, here’s the issue. The -03 node isn’t in a Connected state. Instead, DRBD shows:
sudo drbdadm status
kvm role:Primary
disk:UpToDate open:no
lab-kvm-01 connection:Connecting
lab-kvm-02 connection:Connecting
All the online tricks to get -03 out of Connecting mode and back into Connected mode fails. Meaning, on the -03 node I’ve tried:
sudo drbdadm secondary kvm
sudo drbdadm disconnect kvm
sudo drbdadm --discard-my-data connect kvm
Anybody have any thoughts on all this? Is this a completely wrong method of implementation, even though it worked fine for weeks? Thank you in advance.