Solved: Linstor HA with Proxmox: systemd "degraded" after reboot of backup node

candlerb · May 10, 2025, 7:24pm

I’m in the process of building up a production Proxmox+Linstor cluster. I’m up to 2 nodes, which for historical reasons are called virtual1 and virtual5. I’ve added Linstor HA as per section 3.2 of the Linstor Guide.

For testing purposes, I rebooted virtual5 (the backup). When it came back up, systemd says the system is “degraded”:

root@virtual5:~# systemctl status
● virtual5
    State: degraded
    Units: 569 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 1 units
...

root@virtual5:~# systemctl status --failed
× drbd-promote@linstor_db.service - Promotion of DRBD resource linstor_db
     Loaded: loaded (/lib/systemd/system/drbd-promote@.service; static)
    Drop-In: /run/systemd/system/drbd-promote@linstor_db.service.d
             └─reactor.conf
     Active: failed (Result: exit-code) since Sat 2025-05-10 12:04:26 PDT; 2min 42s ago
       Docs: man:drbd-promote@.service
    Process: 2499 ExecStart=/usr/lib/drbd/scripts/drbd-service-shim.sh primary linstor_db (code=exited, status=17)
   Main PID: 2499 (code=exited, status=17)
        CPU: 2ms

May 10 12:04:26 virtual5 systemd[1]: Starting drbd-promote@linstor_db.service - Promotion of DRBD resource linstor_db...
May 10 12:04:26 virtual5 drbd-linstor_db[2499]: linstor_db: State change failed: (-2) Need access to UpToDate data
May 10 12:04:26 virtual5 systemd[1]: drbd-promote@linstor_db.service: Main process exited, code=exited, status=17/n/a
May 10 12:04:26 virtual5 systemd[1]: drbd-promote@linstor_db.service: Failed with result 'exit-code'.
May 10 12:04:26 virtual5 systemd[1]: Failed to start drbd-promote@linstor_db.service - Promotion of DRBD resource linstor_db.

Is this normal? (In a 2-node setup?)

I’m not sure why it’s trying to promote the linstor_db resource - it should remain as secondary.

This isn’t stopping the controller from working on the primary node, but I’d like to get to the bottom of what’s going on, and I certainly wouldn’t want the secondary to accidentally promote itself.

I’m pretty sure everything is set up right. The linstor_db resource is replicated:

root@virtual1:~# linstor v l --resource linstor_db -p
+------------------------------------------------------------------------------------------------------------------------+
| Resource   | Node     | StoragePool | VolNr | MinorNr | DeviceName    | Allocated | InUse  |    State | Repl           |
|========================================================================================================================|
| linstor_db | virtual1 | sata_ssd    |     0 |    1002 | /dev/drbd1002 |   204 MiB | InUse  | UpToDate | Established(1) |
| linstor_db | virtual5 | sata_ssd    |     0 |    1002 | /dev/drbd1002 |   204 MiB | Unused | UpToDate | Established(1) |
+------------------------------------------------------------------------------------------------------------------------+

root@virtual1:~# linstor rg lp linstor-db-grp -p
+----------------------------------------------------------------------+
| Key                                                | Value           |
|======================================================================|
| DrbdOptions/Resource/auto-promote                  | no              |
| DrbdOptions/Resource/on-no-data-accessible         | io-error        |
| DrbdOptions/Resource/on-no-quorum                  | io-error        |
| DrbdOptions/Resource/on-suspended-primary-outdated | force-secondary |
| DrbdOptions/Resource/quorum                        | majority        |
| Internal/Drbd/QuorumSetBy                          | user            |
+----------------------------------------------------------------------+

It exists on the secondary:

root@virtual5:~# ls -l /dev/drbd/by-res/linstor_db/
total 0
lrwxrwxrwx 1 root root 17 May 10 12:04 0 -> ../../../drbd1002

Both nodes agree which is primary (no split brain AFAICS):

root@virtual1:~# drbdadm status linstor_db
linstor_db role:Primary
  disk:UpToDate open:yes
  virtual5 role:Secondary
    peer-disk:UpToDate

root@virtual5:~# drbdadm status linstor_db
linstor_db role:Secondary
  disk:UpToDate open:no
  virtual1 role:Primary
    peer-disk:UpToDate

This is the status from the reactor on the secondary:

root@virtual5:~# drbd-reactorctl status linstor_db
/etc/drbd-reactor.d/linstor_db.toml:
Promoter: Currently active on node 'virtual1'
○ drbd-services@linstor_db.target
× ├─ drbd-promote@linstor_db.service
○ ├─ var-lib-linstor.mount
○ └─ linstor-controller.service

The configs are identical between primary and secondary:

root@virtual5:~# cat /etc/drbd-reactor.d/linstor_db.toml
[[promoter]]
[promoter.resources.linstor_db]
start = ["var-lib-linstor.mount", "linstor-controller.service"]

root@virtual5:~# cat /etc/systemd/system/var-lib-linstor.mount
[Unit]
Description=Filesystem for the LINSTOR controller

[Mount]
What=/dev/drbd/by-res/linstor_db/0
Where=/var/lib/linstor

root@virtual5:~# cat /etc/systemd/system/linstor-satellite.service.d/override.conf
[Service]
Environment=LS_KEEP_RES=linstor_db

Any clues as to what else I can look for? Thanks!

P.S. A restart of drbd-reactor doesn’t change the system status, nor does it change the output from drbd-reactorctl.

root@virtual5:~# systemctl restart drbd-reactor
root@virtual5:~# systemctl status | head -2
● virtual5
    State: degraded
root@virtual5:~# drbd-reactorctl status linstor_db
/etc/drbd-reactor.d/linstor_db.toml:
Promoter: Currently active on node 'virtual1'
○ drbd-services@linstor_db.target
× ├─ drbd-promote@linstor_db.service
○ ├─ var-lib-linstor.mount
○ └─ linstor-controller.service

candlerb · May 11, 2025, 5:14pm

I tried setting log level = “debug” in drbd-reactor.toml, and rebooting. This shows:

May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] adjust_resources: waiting for backing device '/dev/sata_ssd/linstor_db_00000' to become ready
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] adjust_resources: backing device '/dev/sata_ssd/linstor_db_00000' now ready
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] Checking DRBD options for resource 'linstor_db'
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/drbd-promote@linstor_db.service.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/var-lib-linstor.mount.d/reactor-50-mount.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/var-lib-linstor.mount.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/linstor-controller.service.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/drbd-services@linstor_db.target.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/drbd-services@linstor_db.target.d/reactor-50-before.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin] systemd_daemon_reload: reloading daemon
May 11 09:57:41 virtual5 systemd[1]: Started drbd-reactor.service - DRBD-Reactor Service.
May 11 09:57:41 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor] main: started.len()=1
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_start: systemctl start drbd-services@linstor_db.target
May 11 09:57:41 virtual5 systemctl[2591]: A dependency job for drbd-services@linstor_db.target failed. See 'journalctl -xe' for details.
May 11 09:57:41 virtual5 drbd-reactor[2237]: WARN [drbd_reactor::plugin::promoter] Starting 'linstor_db' failed: Return code not status success
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] stop_actions (could trigger failure actions (e.g., reboot)): linstor_db
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] stop_actions: stopping 'drbd-services@linstor_db.target'
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_stop: systemctl stop drbd-services@linstor_db.target
May 11 09:58:13 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:linstor_db peer-node-id:0 conn-name:virtual1 volume:0 minor:1002 helper:before-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:13 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:linstor_db peer-node-id:0 conn-name:virtual1 volume:0 minor:1002 helper:before-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:13 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:linstor_db peer-node-id:0 conn-name:virtual1 volume:0 minor:1002 helper:after-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:13 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:linstor_db peer-node-id:0 conn-name:virtual1 volume:0 minor:1002 helper:after-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:pm-72c2f543 peer-node-id:0 conn-name:virtual1 volume:0 minor:1005 helper:before-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:pm-72c2f543 peer-node-id:0 conn-name:virtual1 volume:0 minor:1005 helper:before-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:pm-72c2f543 peer-node-id:0 conn-name:virtual1 volume:0 minor:1005 helper:after-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:pm-72c2f543 peer-node-id:0 conn-name:virtual1 volume:0 minor:1005 helper:after-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:pm-d5eb1b28 peer-node-id:0 conn-name:virtual1 volume:0 minor:1003 helper:before-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:pm-d5eb1b28 peer-node-id:0 conn-name:virtual1 volume:0 minor:1003 helper:before-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:pm-af4db968 peer-node-id:0 conn-name:virtual1 volume:0 minor:1006 helper:before-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:pm-af4db968 peer-node-id:0 conn-name:virtual1 volume:0 minor:1006 helper:before-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:pm-d5eb1b28 peer-node-id:0 conn-name:virtual1 volume:0 minor:1003 helper:after-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:pm-d5eb1b28 peer-node-id:0 conn-name:virtual1 volume:0 minor:1003 helper:after-resync-target status:0', because events: parse_events2_line: unknown events type: response
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'call helper name:pm-af4db968 peer-node-id:0 conn-name:virtual1 volume:0 minor:1006 helper:after-resync-target', because events: parse_events2_line: unknown events type: call
May 11 09:58:15 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] process_events2: could not parse line 'response helper name:pm-af4db968 peer-node-id:0 conn-name:virtual1 volume:0 minor:1006 helper:after-resync-target status:0', because events: parse_events2_line: unknown events type: response

Looking at the source, it’s checking that the backing device is “ready”, which AFAICT just means that the /dev/drbdXXXX node exists and is a block device:

fn drbd_backing_device_ready(dev: &str) -> bool {
    dev == "none"
        || match fs::metadata(dev) {
            Err(_) => false,
            Ok(meta) => meta.file_type().is_block_device(),
        }
}

Then it generates the systemd resources, which is fine.

But I still can’t see why it’s decided to start the drbd-services@linstor_db.target immediately. Is it intentional to do this unconditionally for all resources at startup, rather than waiting for the drbd resource to become primary?

And although it then goes on to stop it, it still leaves drbd-promote@linstor_db.service in a failed state.

xquyan · May 11, 2025, 8:14pm

maybe you can try

journalctl --since "2025-05-11 09:57:00" --until "2025-05-11 09:58:00" | grep -i drbd

to see what exactly caused the promotion to stop. From my experience I would guess it could be that it was designed to try to promote right away until it was stopped somewhere in the process, which would make sense.

candlerb · May 11, 2025, 8:41pm

Here you go:

root@virtual5:~# journalctl --since "2025-05-11 09:57:00" --until "2025-05-11 09:58:00" | grep -i drbd
May 11 09:57:38 virtual5 systemd[1]: Reached target drbd-configured.target - All DRBD resources configured (all enabled drbd@.target units started).
May 11 09:57:41 virtual5 systemd[1]: Starting drbd-reactor.service - DRBD-Reactor Service...
May 11 09:57:41 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor] signal-handler: set up done
May 11 09:57:41 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor] signal-handler: waiting for signals
May 11 09:57:41 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor::events] events2_loop: starting process_events2 loop
May 11 09:57:41 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor] main: configuration: Config {
May 11 09:57:41 virtual5 drbd-reactor[2237]:     log: [
May 11 09:57:41 virtual5 drbd-reactor[2237]:         LogConfig {
May 11 09:57:41 virtual5 drbd-reactor[2237]:             level: Debug,
May 11 09:57:41 virtual5 drbd-reactor[2237]:             file: None,
May 11 09:57:41 virtual5 drbd-reactor[2237]:         },
May 11 09:57:41 virtual5 drbd-reactor[2237]:     ],
May 11 09:57:41 virtual5 drbd-reactor[2237]:     statistics_poll_interval: 60,
May 11 09:57:41 virtual5 drbd-reactor[2237]:     snippets: Some(
May 11 09:57:41 virtual5 drbd-reactor[2237]:         "/etc/drbd-reactor.d",
May 11 09:57:41 virtual5 drbd-reactor[2237]:     ),
May 11 09:57:41 virtual5 drbd-reactor[2237]:     plugins: PluginConfig {
May 11 09:57:41 virtual5 drbd-reactor[2237]:         promoter: [
May 11 09:57:41 virtual5 drbd-reactor[2237]:             PromoterConfig {
May 11 09:57:41 virtual5 drbd-reactor[2237]:                 resources: {
May 11 09:57:41 virtual5 drbd-reactor[2237]:                     "linstor_db": PromoterOptResource {
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         start: [
May 11 09:57:41 virtual5 drbd-reactor[2237]:                             "var-lib-linstor.mount",
May 11 09:57:41 virtual5 drbd-reactor[2237]:                             "linstor-controller.service",
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         ],
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         stop: [],
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         on_stop_failure: "",
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         stop_services_on_exit: false,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         runner: Systemd,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         dependencies_as: Requires,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         target_as: Requires,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         on_drbd_demote_failure: None,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         sleep_before_promote_factor: 1,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         preferred_nodes: [],
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         secondary_force: true,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                         on_quorum_loss: Shutdown,
May 11 09:57:41 virtual5 drbd-reactor[2237]:                     },
May 11 09:57:41 virtual5 drbd-reactor[2237]:                 },
May 11 09:57:41 virtual5 drbd-reactor[2237]:                 id: None,
May 11 09:57:41 virtual5 drbd-reactor[2237]:             },
May 11 09:57:41 virtual5 drbd-reactor[2237]:         ],
May 11 09:57:41 virtual5 drbd-reactor[2237]:         debugger: [],
May 11 09:57:41 virtual5 drbd-reactor[2237]:         umh: [],
May 11 09:57:41 virtual5 drbd-reactor[2237]:         prometheus: [],
May 11 09:57:41 virtual5 drbd-reactor[2237]:         agentx: [],
May 11 09:57:41 virtual5 drbd-reactor[2237]:     },
May 11 09:57:41 virtual5 drbd-reactor[2237]: }
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] adjust_resources: waiting for backing device '/dev/sata_ssd/linstor_db_00000' to become ready
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] adjust_resources: backing device '/dev/sata_ssd/linstor_db_00000' now ready
May 11 09:57:41 virtual5 kernel: drbd: module verification failed: signature and/or required key missing - tainting kernel
May 11 09:57:41 virtual5 kernel: drbd: initialized. Version: 9.2.13 (api:2/proto:118-122)
May 11 09:57:41 virtual5 kernel: drbd: GIT-hash: 0457237e0448663529fe161781873b356f17b3c5 build by root@virtual5, 2025-05-10 08:10:03
May 11 09:57:41 virtual5 kernel: drbd: registered as block device major 147
May 11 09:57:41 virtual5 kernel: drbd linstor_db: Starting worker thread (node-id 1)
May 11 09:57:41 virtual5 kernel: drbd: registered transport class 'tcp' (version:9.2.13)
May 11 09:57:41 virtual5 systemd[1]: Finished drbd-graceful-shutdown.service - ensure all DRBD resources shut down gracefully at system shut down.
May 11 09:57:41 virtual5 kernel: drbd linstor_db virtual1: Starting sender thread (peer-node-id 0)
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: meta-data IO uses: blk-bio
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: disk( Diskless -> Attaching ) [attach]
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: Maximum number of peer devices = 7
May 11 09:57:41 virtual5 kernel: drbd linstor_db: Method to ensure write ordering: flush
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: drbd_bm_resize called with capacity == 417624
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: resync bitmap: bits=52203 words=5712 pages=12
May 11 09:57:41 virtual5 kernel: drbd1002: detected capacity change from 0 to 417624
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: size = 204 MB (208812 KB)
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: disk( Attaching -> Outdated ) [attach]
May 11 09:57:41 virtual5 kernel: drbd linstor_db/0 drbd1002: attached to current UUID: D734B802E66C305C
May 11 09:57:41 virtual5 kernel: drbd linstor_db virtual1: conn( StandAlone -> Unconnected ) [connect]
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] Checking DRBD options for resource 'linstor_db'
May 11 09:57:41 virtual5 kernel: drbd linstor_db virtual1: Starting receiver thread (peer-node-id 0)
May 11 09:57:41 virtual5 kernel: drbd linstor_db virtual1: conn( Unconnected -> Connecting ) [connecting]
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/drbd-promote@linstor_db.service.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/var-lib-linstor.mount.d/reactor-50-mount.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/var-lib-linstor.mount.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/linstor-controller.service.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/drbd-services@linstor_db.target.d/reactor.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_write_unit: creating "/run/systemd/system/drbd-services@linstor_db.target.d/reactor-50-before.conf"
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin] systemd_daemon_reload: reloading daemon
May 11 09:57:41 virtual5 systemd[1]: Started drbd-reactor.service - DRBD-Reactor Service.
May 11 09:57:41 virtual5 drbd-reactor[2237]: DEBUG [drbd_reactor] main: started.len()=1
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_start: systemctl start drbd-services@linstor_db.target
May 11 09:57:41 virtual5 systemd[1]: Created slice system-drbd.slice - Slice /system/drbd.
May 11 09:57:41 virtual5 systemd[1]: Created slice system-drbd\x2dpromote.slice - Slice /system/drbd-promote.
May 11 09:57:41 virtual5 systemd[1]: Starting drbd@linstor_db.service - (Re)configure DRBD resource linstor_db...
May 11 09:57:41 virtual5 systemd[1]: Finished drbd@linstor_db.service - (Re)configure DRBD resource linstor_db.
May 11 09:57:41 virtual5 systemd[1]: Reached target drbd@linstor_db.target - DBRD linstor_db configured.
May 11 09:57:41 virtual5 systemd[1]: Starting drbd-promote@linstor_db.service - Promotion of DRBD resource linstor_db...
May 11 09:57:41 virtual5 drbd-linstor_db[2611]: linstor_db: State change failed: (-2) Need access to UpToDate data
May 11 09:57:41 virtual5 systemd[1]: drbd-promote@linstor_db.service: Main process exited, code=exited, status=17/n/a
May 11 09:57:41 virtual5 systemd[1]: drbd-promote@linstor_db.service: Failed with result 'exit-code'.
May 11 09:57:41 virtual5 systemd[1]: Failed to start drbd-promote@linstor_db.service - Promotion of DRBD resource linstor_db.
May 11 09:57:41 virtual5 systemd[1]: Dependency failed for var-lib-linstor.mount - drbd-reactor controlled var-lib-linstor.
May 11 09:57:41 virtual5 systemd[1]: Dependency failed for linstor-controller.service - drbd-reactor controlled linstor-controller.
May 11 09:57:41 virtual5 systemd[1]: Dependency failed for drbd-services@linstor_db.target - Services for DRBD resource linstor_db.
May 11 09:57:41 virtual5 systemd[1]: drbd-services@linstor_db.target: Job drbd-services@linstor_db.target/start failed with result 'dependency'.
May 11 09:57:41 virtual5 systemctl[2591]: A dependency job for drbd-services@linstor_db.target failed. See 'journalctl -xe' for details.
May 11 09:57:41 virtual5 drbd-reactor[2237]: WARN [drbd_reactor::plugin::promoter] Starting 'linstor_db' failed: Return code not status success
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] stop_actions (could trigger failure actions (e.g., reboot)): linstor_db
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] stop_actions: stopping 'drbd-services@linstor_db.target'
May 11 09:57:41 virtual5 drbd-reactor[2237]: INFO [drbd_reactor::plugin::promoter] systemd_stop: systemctl stop drbd-services@linstor_db.target
May 11 09:57:43 virtual5 Satellite[2239]: 2025-05-11 09:57:43.682 [Main] INFO  LINSTOR/Satellite/ff71f5 SYSTEM - Checking support for DRBD9_KERNEL: supported (9.2.13)
May 11 09:57:43 virtual5 Satellite[2239]: 2025-05-11 09:57:43.683 [Main] INFO  LINSTOR/Satellite/ff71f5 SYSTEM - Checking support for DRBD9_UTILS: supported (9.31.0)
May 11 09:57:43 virtual5 Satellite[2239]: 2025-05-11 09:57:43.687 [Main] INFO  LINSTOR/Satellite/ff71f5 SYSTEM - Checking support for DRBD_PROXY: NOT supported:
May 11 09:57:44 virtual5 Satellite[2239]: 2025-05-11 09:57:44.003 [Main] INFO  LINSTOR/Satellite/ff71f5 SYSTEM - Starting service instance 'DrbdEventService-1' of type DrbdEventService
May 11 09:57:44 virtual5 Satellite[2239]: 2025-05-11 09:57:44.003 [Main] INFO  LINSTOR/Satellite/ff71f5 SYSTEM - Starting service instance 'DrbdEventPublisher-1' of type DrbdEventPublisher

I think it’s correct that promotion fails; the other node is already primary, with /var/lib/linstor mounted and the controller running there. It would be a terrible thing to pre-empt it.

But on this secondary node, I was expecting the reactor not to attempt to promote itself to primary in the first place. Why would it, if it can see that there’s valid quorum and some other node is already primary?

That’s really what I’m trying to get to the bottom of. Have I misconfigured drbd-reactor in some way (although the config comes straight from the Linstor documentation)? Or does drbd-reactor always attempt to promote resources on startup?

By trying and failing, it causes systemd to end up in an unclean “degraded” state, and that’s ultimately what I’m trying to eliminate.

candlerb · May 11, 2025, 8:44pm

Aside: for some reason my initial post has been “flagged by the community”

xquyan · May 12, 2025, 5:23am

So I don’t think I see what I was looking for, but I’ll share with you what I had in mind. I saw a similar incident on my cluster, and after going through the logs thoroughly it seems to me that it was intentional to let the node try to promote, until it finds out that it shouldn’t. This made sense to me because checking everything along the way would be faster than checking things before hand, which means it’s better HA-wise. As for is it really like that or why take this route, we would need an insider to tell us.

candlerb · May 13, 2025, 6:57am

I looked in the source, clearly there’s an intention not to promote if another node is primary - although if there is any doubt, it will go ahead and try to promote it anyway.

===== src/plugin/promoter.rs =====

impl super::Plugin for Promoter {
    fn run(&self, rx: super::PluginReceiver) -> Result<()> {
...
        let names = self.cfg.resources.keys().cloned().collect::<Vec<String>>();
...
        // the target might be in a half started state, for example after a "disable" failed to
        // demote.
        // Then on start we might never get a may_promote:yes and we never start the services on
        // top of the promoted Primary.
        // inserting the names into may_promote HashSet is tempting, but the first update might delete
        // them before the first ticker, so just start them as usual and be done
        for name in &names {
            if !try_initial_target_start(name) {
                continue;
            }
            if let Some(res) = cfg.resources.get(name) {
                // startup is usually always racy, but pretend nodes started at the same time and
                // calculate a position in the preferred nodes. This also helps tests.
                let sleep_s = get_preferred_nodes_sleep_s(&res.preferred_nodes);
                thread::sleep(time::Duration::from_secs(sleep_s));

                try_start_stop_actions(name, &res.start, &res.stop, &res.runner);
            }
        }
...
fn try_initial_target_start(name: &str) -> bool {
    // if we know for sure that a remote is Primary, then we don't try
    // in all other cases, even if unsure, we can try
    match get_primary(name) {
        Ok(PrimaryOn::Remote(_)) => false,
        _ => true,
    }
}

===== src/drbd.rs =====

pub fn get_primary(drbd_resource: &str) -> anyhow::Result<PrimaryOn> {
    let output = Command::new("drbdsetup")
        .arg("status")
        .arg("--json")
        .arg(drbd_resource)
        .output()?;
    if !output.status.success() {
        return Err(anyhow::anyhow!(
            "'drbdsetup show' not executed successfully"
        ));
    }
...
    let resources: Vec<Resource> = serde_json::from_slice(&output.stdout)?;
    if resources.len() != 1 {
        return Err(anyhow::anyhow!(
            "resources length from drbdsetup status not exactly 1"
        ));
    }

    // is it me?
    if resources[0].role == Role::Primary {
        return Ok(PrimaryOn::Local);
    }

    // a peer?
    for conn in &resources[0].connections {
        if conn.peer_role == Role::Primary {
            return Ok(PrimaryOn::Remote(conn.name.clone()));
        }
    }

    Ok(PrimaryOn::None)

I checked drbdsetup status output on the secondary node, and it looks OK:

root@virtual5:~# drbdsetup status --json linstor_db
[
{
  "name": "linstor_db",
  "node-id": 1,
  "role": "Secondary",
  "suspended": false,
  "suspended-user": false,
  "suspended-no-data": false,
  "suspended-fencing": false,
  "suspended-quorum": false,
  "force-io-failures": false,
  "write-ordering": "flush",
  "devices": [
    {
      "volume": 0,
      "minor": 1002,
      "disk-state": "UpToDate",
      "client": false,
      "open": false,
      "quorum": true,
      "size": 208812,
      "read": 0,
      "written": 20,
      "al-writes": 0,
      "bm-writes": 0,
      "upper-pending": 0,
      "lower-pending": 0
    } ],
  "connections": [
    {
      "peer-node-id": 0,
      "name": "virtual1",
      "connection-state": "Connected",
      "congested": false,
      "peer-role": "Primary",
      "tls": false,
      "ap-in-flight": 0,
      "rs-in-flight": 0,
      "paths": [
        {
          "this_host": {
            "address": "10.0.0.5",
            "port": 7002,
            "family": "ipv4"
          },
          "remote_host": {
            "address": "10.0.0.1",
            "port": 7002,
            "family": "ipv4"
          },
          "established": true
        } ],
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Established",
          "peer-disk-state": "UpToDate",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 20,
          "sent": 0,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]
}
]

root@virtual5:~# echo $?
0

So I’ll need to check:

is this the status seen when drbd-reactor starts?
is the JSON being parsed successfully?
otherwise, is a drbd “may promote” event being received?

candlerb · May 13, 2025, 12:40pm

I replaced drbdsetup with a wrapper script to capture its output.

At the time drbd-reactor starts during machine boot, this is the status output:

[
{
  "name": "linstor_db",
  "node-id": 1,
  "role": "Secondary",
  "suspended": false,
  "suspended-user": false,
  "suspended-no-data": false,
  "suspended-fencing": false,
  "suspended-quorum": false,
  "force-io-failures": false,
  "write-ordering": "flush",
  "devices": [
    {
      "volume": 0,
      "minor": 1002,
      "disk-state": "Outdated",
      "client": false,
      "open": false,
      "quorum": false,
      "size": 208812,
      "read": 0,
      "written": 0,
      "al-writes": 0,
      "bm-writes": 0,
      "upper-pending": 0,
      "lower-pending": 0
    } ],
  "connections": [
    {
      "peer-node-id": 0,
      "name": "virtual1",
      "connection-state": "Connecting",
      "congested": false,
      "peer-role": "Unknown",
      "tls": false,
      "ap-in-flight": 0,
      "rs-in-flight": 0,
      "paths": [
        {
          "this_host": {
            "address": "10.0.0.5",
            "port": 7002,
            "family": "ipv4"
          },
          "remote_host": {
            "address": "10.0.0.1",
            "port": 7002,
            "family": "ipv4"
          },
          "established": false
        } ],
      "peer_devices": [
        {
          "volume": 0,
          "replication-state": "Off",
          "peer-disk-state": "DUnknown",
          "peer-client": false,
          "resync-suspended": "no",
          "received": 0,
          "sent": 0,
          "out-of-sync": 0,
          "pending": 0,
          "unacked": 0,
          "has-sync-details": false,
          "has-online-verify-details": false,
          "percent-in-sync": 100.00
        } ]
    } ]
}
]

The peer-role is Unknown, presumably because the connection-state is Connecting.

I guess this could be fixed by delaying the startup of drbd-reactor slightly to allow drbd state to stabilise. But I’m happy that I understand now what’s causing this behaviour.

Solution: systemctl edit drbd-reactor.service

[Service]
# Allow time to notice peer is primary
ExecStartPre=/bin/sleep 40

(I found that drbd connection is established around 35 seconds after the network is up; this might be to do with STP on the upstream switch, which I’ll check)

Topic		Replies	Views
Solved: Could not connect to any LINSTOR controller (after HA) LINSTOR drbd	1	226	January 21, 2025
VIP not working DRBD Reactor drbd	8	35	April 24, 2025
Upgrade Cluster with LINSTOR Database on HA Storage / Proxmox Proxmox VE drbd , linstor , upgrading	1	97	November 18, 2024
Linstor Failure on specific node when switchover LINSTOR	2	177	August 9, 2024
Drbd-reactor and virtual ip DRBD Reactor drbd , linstor	8	105	March 25, 2025

Solved: Linstor HA with Proxmox: systemd "degraded" after reboot of backup node

Related topics