KVM libvirt qemu LVM - Recommended Setup

Is this a recommended setup:

RHEL 9 based servers.
3x Dell Servers with hardware raid controller creates 20 TB disks on each server.
DRBD on top of the raid created /dev/sdb virtual disk configured with:

resource kvm {
    device      minor 1;
    disk        "/dev/drbd1";
    meta-disk   internal;

    net {
        protocol C;
        allow-two-primaries yes;
    }

    options {
        auto-promote    no;
        quorum          majority;
    }

    on lab-kvm-01 {
      address 10.0.0.1:7788;
      node-id 1;
    }
    on lab-kvm-02 {
      address 10.0.0.2:7788;
      node-id 2;
    }
    on lab-kvm-03 {
      address 10.0.0.3:7788;
      node-id 3;
    }
    connection-mesh {
      hosts lab-kvm-01 lab-kvm-02 lab-kvm-03;
    }
}
sudo drbdadm create-md kvm
sudo drbdadm up kvm
sudo drbdadm primary kvm --force
sudo systemctl enable --now drbd@kvm.target
sudo drbdadm status

LVM PV created on top of /dev/drbd1
sudo pvcreate /dev/drbd1

LVM VG created
sudo vgcreate storage /dev/drbd1

LVM LV thinpool created
sudo lvcreate -l 100%FREE -T storage/lvthinpool

Finally, each VM is a separate LV within the thinpool.
sudo lvcreate -V 50G -T storage/lvthinpool -n lab-vmserver-01

Live migrations work. Disk space is easily increased as needed for the VMs. Everything “just works”. Great setup, really like how DRBD keeps the backend block device(s) synced.

Now comes the problem. Reboot a host. After running for days/weeks, without issues, would like to patch and reboot a host. Manually migrate all the VMs over to the -01 node, and ready to reboot -03 node. When -03 comes back up, I manually check and bring everything back online:

sudo drbdadm primary kvm --force
sudo vgchange -ay storage
sudo lvscan

We can see all the LVMs, all marked as ACTIVE and ready to be failed over from the -01 node. However, here’s the issue. The -03 node isn’t in a Connected state. Instead, DRBD shows:

sudo drbdadm status

kvm role:Primary
  disk:UpToDate open:no
  lab-kvm-01 connection:Connecting
  lab-kvm-02 connection:Connecting

All the online tricks to get -03 out of Connecting mode and back into Connected mode fails. Meaning, on the -03 node I’ve tried:

sudo drbdadm secondary kvm
sudo drbdadm disconnect kvm 
sudo drbdadm --discard-my-data connect kvm

Anybody have any thoughts on all this? Is this a completely wrong method of implementation, even though it worked fine for weeks? Thank you in advance.

What do the other two nodes show for drbdadm status?

Have you tried tcpdump to see if lab-kvm-03 is attempting to make outbound connections to the other two?

Is there any firewalling in place?

Any relevant logs in dmesg?

Is this a completely wrong method of implementation, even though it worked fine for weeks?

Effectively this is one big storage pool being accessed concurrently by multiple consumers. Are you using clustered LVM and sanlock or similar?

An alternative approach you might want to evaluate is linstor. This creates separate LVM volumes for each VM’s disk, and then runs separate drbd instances, meaning that you can move primary around for each VM individually.

Other two nodes show:

  disk:UpToDate open:yes
  lab-kvm-02 role:Primary
    peer-disk:UpToDate
  lab-kvm-03 connection:StandAlone

No FW in the mix, and in fact, even though -03 says “Connecting” I can still bring it online, and even live migrate a VM over to it. Crazy! I just can’t get it to out of the Connecting state.

No clustered LVM or sanlock needed, because the one-lv-per-vm setup makes sure that only one KVM node is trying to launch/write to the /dev/lvmstorage/lab1-vmserver-01 at a time. This setup is describe in this old email thread DRBD and KVM for a HA-Cluster ?

For now, I would really like to stay with this seemingly very simple setup that has “just worked” for weeks. I have even rebooted these servers in the past and got them to come back online properly. Today is the first time it’s gone Connecting, and won’t come out. Weird.

Non-clustered LVM believes it’s the sole user of the disk.

If you tried to update LVM metadata (e.g. create a new logical volume) on two different nodes, or if you created a LV on one node and the other node had some stale cached information, Bad Things™ could happen.

True, understood, could happen. Unfortunately the current LVM setup isn’t causing the issue I’m currently having with DRBD stuck in Connecting state.

On the nodes where DRBD resports StandAlone try a drbdadm adjust kvm. That should instruct DRBD to switch back to Connecting and again attempt to connect to the peer nodes. If this again switches back to StandAlone then DRBD should clearly log why in the kernel log.

To be clear here. The 03 node is stuck in a Connecting state because the peer nodes are not even listening or attempting to connect to it (the StandAlone state). The Connecting state indicates that DRBD is listening for the peer nodes and looking for the peer nodes, but still not yet connected. The StandAlone state means that it is not even listening or attempting to connect to the peer(s). DRBD may switch to StandaAlone for a few reasons. Common examples would be in the case of a split-brain, different sized backing disks, or if it believes the peer has completely unrelated data. The reason for the StandAlone state should be clearly logged though.

Thank you Devin, I decided to sort of start over and try to reproduce this issue (even with lvmlockd this time), and I was able to. So, going off my original post, but with just a few additional steps for the clustered LVM part:

sudo dnf -y install lvm2-lockd sanlock
sudo sed -i -e 's/# use_lvmlockd = 0/use_lvmlockd = 1/' /etc/lvm/lvm.conf
sudo sed -i -e 's/# host_id = 0/host_id = 1/' /etc/lvm/lvmlocal.conf
# change the above per host ( 2, 3, etc.)
sudo systemctl enable --now lvmlockd
sudo systemctl enable --now sanlock

Then on just the -01 node, create the shared VG:

sudo vgcreate --shared lvmstorage /dev/drbd1
sudo lvcreate -l 100%FREE -T lvmstorage/lvmpool

So, even with the lvmlockd piece all in place, I brought some VMs online on the -01 server, checked and made sure that -02 and -03 were both in a state of:

kvm role:Secondary
  disk:UpToDate open:no
  lab-kvm-01 role:Primary
    peer-disk:UpToDate
  lab-kvm-03 role:Secondary
    peer-disk:UpToDate
kvm role:Secondary
  disk:UpToDate open:no
  lab-kvm-01 role:Primary
    peer-disk:UpToDate
  lab-kvm-02 role:Secondary
    peer-disk:UpToDate

Time to test the reboot again, and that’s…I think…when I figured out what I was doing wrong. With all 3 up and working as expected, I hop on the -03 host and simply type sudo reboot. That’s not the right process. I need to stop drbd first on the node, before I reboot it, otherwise I’ll get the same results as above, and it takes some extra effort in order to get it back in sync. So, instead, I did this:

sudo systemctl stop drbd@kvm
sudo reboot

I wasn’t sure why this was needed. But it is. That, or I can manually bring the resource down:

sudo drbdadm down kvm
sudo reboot

Do either of those before the reboot, and then when the -03 comes back online do:

sudo systemctl restart drbd@kvm
sudo drbdadm status

…and so far, after every reboot everything comes back as UpToDate. Again, odd it requires this extra manual step. Especially that the systemd service doesn’t take care of that during reboot, but either way, I’ve updated my documentation and everything seems good to go from here.

Damn. Spoke too soon. Come to find out thinpools do not work with lvmlockd so started over, again, but this time I have a bit more of an idea what’s going on, but not sure if I’m missing a step.

When the node -03 is rebooted,is there something specific I should be running. Nothing in the logs indicate split brain. Nothing in the logs indicate any sort of issue. I just can’t get -03 out of the Connecting state. Very frustrating, as everything works perfectly up until a reboot.

Check the logs. There might be some very clear reason as to why it’s stuck in the connecting state. Otherwise, I would maybe guess a firewall or something of the stort getting re-activated on reboot. Might be worthwhile to stop DRBD and test connectivity to the specific port via netcat if there is not other clear reason.