"SCSI Bus reset detected" error messages within VM on DRBD

major · July 20, 2024, 6:43pm

Hi,

I run a 3-node Proxmox cluster with linstor/drbd. Storage pools are supported by ZFS.

I do have one VM with weird disk issues:

[Sat Jul 20 20:13:08 2024] scsi target2:0:0: No MSG IN phase after reselection
[Sat Jul 20 20:13:38 2024] sd 2:0:0:0: [sda] tag#266 ABORT operation started
[Sat Jul 20 20:13:43 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:13:43 2024] sd 2:0:0:0: [sda] tag#245 ABORT operation started
[Sat Jul 20 20:13:48 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:13:48 2024] sd 2:0:0:0: [sda] tag#244 ABORT operation started
[Sat Jul 20 20:13:53 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:13:53 2024] sd 2:0:0:0: [sda] tag#243 ABORT operation started
[Sat Jul 20 20:13:58 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:13:58 2024] sd 2:0:0:0: [sda] tag#242 ABORT operation started
[Sat Jul 20 20:14:03 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:03 2024] sd 2:0:0:0: [sda] tag#246 ABORT operation started
[Sat Jul 20 20:14:08 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:08 2024] sd 2:0:0:0: [sda] tag#247 ABORT operation started
[Sat Jul 20 20:14:13 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:13 2024] sd 2:0:0:0: [sda] tag#249 ABORT operation started
[Sat Jul 20 20:14:19 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:19 2024] sd 2:0:0:0: [sda] tag#248 ABORT operation started
[Sat Jul 20 20:14:24 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:24 2024] sd 2:0:0:0: [sda] tag#267 ABORT operation started
[Sat Jul 20 20:14:29 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:29 2024] sd 2:0:0:0: [sda] tag#268 ABORT operation started
[Sat Jul 20 20:14:34 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:34 2024] sd 2:0:0:0: [sda] tag#269 ABORT operation started
[Sat Jul 20 20:14:39 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:39 2024] sd 2:0:0:0: [sda] tag#270 ABORT operation started
[Sat Jul 20 20:14:44 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:44 2024] sd 2:0:0:0: [sda] tag#271 ABORT operation started
[Sat Jul 20 20:14:49 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:49 2024] sd 2:0:0:0: [sda] tag#272 ABORT operation started
[Sat Jul 20 20:14:54 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:14:54 2024] sd 2:0:0:0: [sda] tag#250 ABORT operation started
[Sat Jul 20 20:15:00 2024] sd 2:0:0:0: ABORT operation timed-out.
[Sat Jul 20 20:15:00 2024] sd 2:0:0:0: [sda] tag#266 DEVICE RESET operation started
[Sat Jul 20 20:15:05 2024] sd 2:0:0:0: DEVICE RESET operation timed-out.
[Sat Jul 20 20:15:05 2024] sd 2:0:0:0: [sda] tag#250 BUS RESET operation started
[Sat Jul 20 20:15:05 2024] sym0: SCSI BUS reset detected.
[Sat Jul 20 20:15:05 2024] sd 2:0:0:0: BUS RESET operation complete.
[Sat Jul 20 20:15:05 2024] sym0: SCSI BUS has been reset.
[Sat Jul 20 20:15:15 2024] sd 2:0:0:0: Power-on or device reset occurred

On the Proxmox server side the configuration is as follows:

pve2 ~ # linstor n l
╭─────────────────────────────────────────────────────╮
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞═════════════════════════════════════════════════════╡
┊ pve1 ┊ SATELLITE ┊ 10.2.12.11:3366 (PLAIN) ┊ Online ┊
┊ pve2 ┊ SATELLITE ┊ 10.2.12.12:3366 (PLAIN) ┊ Online ┊
┊ pve3 ┊ SATELLITE ┊ 10.2.12.13:3366 (PLAIN) ┊ Online ┊
╰─────────────────────────────────────────────────────╯

pve2 ~ # linstor r l |grep 62e40
| pm-62e407a6 | pve1 | 7040 | Unused | Ok | TieBreaker | 2024-07-20 18:57:49 |
| pm-62e407a6 | pve2 | 7040 | InUse | Ok | UpToDate | 2024-07-20 18:56:04 |
| pm-62e407a6 | pve3 | 7040 | Unused | Ok | UpToDate | 2024-07-20 18:56:10 |

In dmesg on the proxmox host I see a lot of these, but never at the same timestamp as I get the problems within the VM (clocks are in sync)

[Sat Jul 20 19:55:58 2024] sd 3:0:1:0: [sdb] tag#80 Sense Key : Recovered Error [current]
[Sat Jul 20 19:55:58 2024] sd 3:0:1:0: [sdb] tag#80 Add. Sense: Defect list not found
[Sat Jul 20 19:55:58 2024] sd 3:0:2:0: [sdc] tag#124 Sense Key : Recovered Error [current]
[Sat Jul 20 19:55:58 2024] sd 3:0:2:0: [sdc] tag#124 Add. Sense: Defect list not found
[Sat Jul 20 19:55:59 2024] sd 3:0:3:0: [sdd] tag#267 Sense Key : Recovered Error [current]
[Sat Jul 20 19:55:59 2024] sd 3:0:3:0: [sdd] tag#267 Add. Sense: Defect list not found
[Sat Jul 20 19:55:59 2024] sd 3:0:4:0: [sde] tag#293 Sense Key : Recovered Error [current]
[Sat Jul 20 19:55:59 2024] sd 3:0:4:0: [sde] tag#293 Add. Sense: Defect list not found
[Sat Jul 20 19:56:00 2024] sd 3:0:5:0: [sdf] tag#249 Sense Key : Recovered Error [current]
[Sat Jul 20 19:56:00 2024] sd 3:0:5:0: [sdf] tag#249 Add. Sense: Defect list not found
[Sat Jul 20 20:11:07 2024] sd 3:0:1:0: [sdb] tag#801 Sense Key : Recovered Error [current]
[Sat Jul 20 20:11:07 2024] sd 3:0:1:0: [sdb] tag#801 Add. Sense: Defect list not found
[Sat Jul 20 20:11:08 2024] sd 3:0:2:0: [sdc] tag#997 Sense Key : Recovered Error [current]
[Sat Jul 20 20:11:08 2024] sd 3:0:2:0: [sdc] tag#997 Add. Sense: Defect list not found
[Sat Jul 20 20:11:08 2024] sd 3:0:3:0: [sdd] tag#923 Sense Key : Recovered Error [current]
[Sat Jul 20 20:11:08 2024] sd 3:0:3:0: [sdd] tag#923 Add. Sense: Defect list not found
[Sat Jul 20 20:11:08 2024] sd 3:0:4:0: [sde] tag#805 Sense Key : Recovered Error [current]
[Sat Jul 20 20:11:08 2024] sd 3:0:4:0: [sde] tag#805 Add. Sense: Defect list not found
[Sat Jul 20 20:11:09 2024] sd 3:0:5:0: [sdf] tag#921 Sense Key : Recovered Error [current]
[Sat Jul 20 20:11:09 2024] sd 3:0:5:0: [sdf] tag#921 Add. Sense: Defect list not found
[Sat Jul 20 20:27:07 2024] sd 3:0:1:0: [sdb] tag#904 Sense Key : Recovered Error [current]
[Sat Jul 20 20:27:07 2024] sd 3:0:1:0: [sdb] tag#904 Add. Sense: Defect list not found
[Sat Jul 20 20:27:08 2024] sd 3:0:2:0: [sdc] tag#779 Sense Key : Recovered Error [current]
[Sat Jul 20 20:27:08 2024] sd 3:0:2:0: [sdc] tag#779 Add. Sense: Defect list not found
[Sat Jul 20 20:27:08 2024] sd 3:0:3:0: [sdd] tag#794 Sense Key : Recovered Error [current]
[Sat Jul 20 20:27:08 2024] sd 3:0:3:0: [sdd] tag#794 Add. Sense: Defect list not found
[Sat Jul 20 20:27:09 2024] sd 3:0:4:0: [sde] tag#70 Sense Key : Recovered Error [current]
[Sat Jul 20 20:27:09 2024] sd 3:0:4:0: [sde] tag#70 Add. Sense: Defect list not found
[Sat Jul 20 20:27:09 2024] sd 3:0:5:0: [sdf] tag#802 Sense Key : Recovered Error [current]
[Sat Jul 20 20:27:09 2024] sd 3:0:5:0: [sdf] tag#802 Add. Sense: Defect list not found

The zfs pool looks as follows:

pve2 ~ # zpool status -v
pool: zpool_ssd
state: ONLINE
scan: scrub repaired 0B in 00:29:01 with 0 errors on Sun Jul 14 00:53:02 2024
config:

NAME STATE READ WRITE CKSUM
zpool_ssd ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0

errors: No known data errors

I do not know whether these issues correlate, or what causes the scsi bus reset within the VM.

Any ideas or hints on where to debug this are very welcome.

Additional info:

pve2 ~ # cat /proc/drbd
version: 9.2.10 (api:2/proto:86-122)
GIT-hash: b92a320cb72a0b85144e742da5930f2d3b6ce30c build by root@pve2, 2024-07-01 20:24:55
Transports (api:21): tcp (9.2.10)

pve2 ~ # dpkg -l |grep linstor
ii linstor-client 1.23.0-1 all Linstor client command line tool
ii linstor-common 1.28.0-1 all DRBD distributed resource management utility
ii linstor-controller 1.28.0-1 all DRBD distributed resource management utility
ii linstor-proxmox 8.0.3-1 all DRBD distributed resource management utility
ii linstor-satellite 1.28.0-1 all DRBD distributed resource management utility
ii python-linstor 1.23.0-1 all Linstor python api library

pve2 ~ # dpkg -l |grep zfs
ii libzfs4linux 2.2.4-pve1 amd64 OpenZFS filesystem library for Linux - general support
ii zfs-initramfs 2.2.4-pve1 all OpenZFS root filesystem capabilities for Linux - initramfs
ii zfs-zed 2.2.4-pve1 amd64 OpenZFS Event Daemon
ii zfsutils-linux 2.2.4-pve1 amd64 command-line tools to manage OpenZFS filesystems

Topic		Replies	Views
Recover from abrupt server shutdown General	6	121	July 24, 2024
Working linstor/drbd proxmox cluster fails to create new volumes LINSTOR drbd	1	209	November 29, 2024
Question about 2.15. Disk Error Handling Strategies (from the User Guide) DRBD	33	307	September 11, 2024
Migrating Proxmox-VM(s) fails with 'Wrong medium type' on the target node (Linstor-DRBD-VMs) Proxmox VE drbd	16	554	January 23, 2025
Linstor_db outdated deleting Proxmox VE	0	163	July 16, 2024

"SCSI Bus reset detected" error messages within VM on DRBD

Related topics