running here is a three node cluster of Proxmox VE 8.2 with the Linstor-DRBD-Backend enabled (great, I like it!).
Live migration of VMs from one node to another has successfully being tested a few week ago.
Now, when I want to migrate a VM (any of them) to another node, the task produces an error that reads:
[node3] kvm: -drive file=/dev/drbd/by-res/vm-101-disk-1/0,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on: Could not open '/dev/drbd/by-res/vm-101-disk-1/0': Wrong medium type
The system log of the source node says:
Oct 14 14:47:11 ttivs1 pvedaemon[276824]: <root@pam> starting task UPID:ttivs1:00048289:00625DEC:670D12CF:qmigrate:101:root@pam:
Oct 14 14:47:13 ttivs1 pmxcfs[1867]: [status] notice: received log
Oct 14 14:47:14 ttivs1 Controller[4651]: 2024-10-14 14:47:14.013 [grizzly-http-server-16] INFO LINSTOR/Controller/58bb4b SYSTEM - REST/API RestClient(10.10.10.3; 'linstor-proxmox/8.0.4')/ModRscDfn
Oct 14 14:47:14 ttivs1 Controller[4651]: 2024-10-14 14:47:14.014 [grizzly-http-server-16] INFO LINSTOR/Controller/58bb4b SYSTEM - Resource definition modified vm-101-disk-1/false
Oct 14 14:47:14 ttivs1 Controller[4651]: 2024-10-14 14:47:14.017 [grizzly-http-server-17] INFO LINSTOR/Controller/08d41b SYSTEM - REST/API RestClient(10.10.10.3; 'linstor-proxmox/8.0.4')/LstVlm
Oct 14 14:47:14 ttivs1 Controller[4651]: 2024-10-14 14:47:14.305 [grizzly-http-server-19] INFO LINSTOR/Controller/988e30 SYSTEM - REST/API RestClient(10.10.10.3; 'linstor-proxmox/8.0.4')/LstVlm
Oct 14 14:47:14 ttivs1 pmxcfs[1867]: [status] notice: received log
Oct 14 14:47:15 ttivs1 pmxcfs[1867]: [status] notice: received log
Oct 14 14:47:15 ttivs1 Controller[4651]: 2024-10-14 14:47:15.743 [grizzly-http-server-21] INFO LINSTOR/Controller/769f8b SYSTEM - REST/API RestClient(10.10.10.3; 'linstor-proxmox/8.0.4')/LstVlm
Oct 14 14:47:15 ttivs1 pmxcfs[1867]: [status] notice: received log
Oct 14 14:47:15 ttivs1 pvedaemon[295561]: migration problems
Oct 14 14:47:15 ttivs1 pvedaemon[276824]: <root@pam> end task UPID:ttivs1:00048289:00625DEC:670D12CF:qmigrate:101:root@pam: migration problems
Updates have been made since the last successful attempt of moving a VM, so that may be a reason for the problem?
Does or did anyone experience the same issue?
What kind of additional information would you need from me?
What does cat /proc/drbd; drbdadm status show from the target node?
I’m Wondering if the update installed a new kernel without installing a new DRBD kernel module.
How is your Proxmox storage.cfg configured? Does it use the virtual IP address configured in DRBD Reactor for the HA LINSTOR Controller?
Can you post the output of the following commands:
linstor resource list from the LINSTOR controller
drbdadm status from the source and destination VM migration nodes
drbdsetup show from the source and destination VM migration nodes
grep -ie satellite -ie controller -ie drbd -ie pve /var/log/syslog from the source and destination VM migration nodes as well as the LINSTOR Controller after an attempted migration.
vm-100-disk-1 role:Secondary
disk:UpToDate
ttivs2 role:Secondary
peer-disk:UpToDate
ttivs3 role:Secondary
peer-disk:UpToDate
.
.
. and so on for more disks.
vm-100-disk-1 role:Secondary
disk:UpToDate
ttivs1 role:Secondary
peer-disk:UpToDate
ttivs2 role:Secondary
peer-disk:UpToDate
.
.
. and so on for more disks.
maybe(!) I made one mistake in setting up the highly available controller.
The documentation reads:
–
The last but nevertheless important step is to configure the LINSTOR satellite services to not delete (and then regenerate) the resource file for the LINSTOR controller DB at its startup. Do not edit the service files directly, but use systemctl edit. Edit the service file on all nodes that could become a LINSTOR controller and that are also LINSTOR satellites.
systemctl edit linstor-satellite
[Service]
Environment=LS_KEEP_RES=linstor_db
Now it’s configured that way, but maybe some resource files came out of sync?
How can I check this?
Migration fails with:
2024-11-06 16:05:02 starting migration of VM 104 to node ‘ttivs3’ (10.10.10.3)
2024-11-06 16:05:02 starting VM 104 on remote node ‘ttivs3’
2024-11-06 16:05:03 [ttivs3] kvm: -drive file=/dev/drbd/by-res/vm-104-disk-1/0,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on: Could not open ‘/dev/drbd/by-res/vm-104-disk-1/0’: Wrong medium type
2024-11-06 16:05:04 [ttivs3] start failed: QEMU exited with code 1
2024-11-06 16:05:04 ERROR: online migrate failure - remote command failed with exit code 255
2024-11-06 16:05:04 aborting phase 2 - cleanup resources
2024-11-06 16:05:04 migrate_cancel
2024-11-06 16:05:05 ERROR: migration finished with problems (duration 00:00:05)
TASK ERROR: migration problems
I think the underlying DRBD device doesn’t do the primary/secondary switch, because the ‘wrong medium type’ error seems to happen when the device of the secondary is being accessed.