Running drbd-reactorctl status
get err message when use ha linstor
Follow the documentation
https://linbit.com/drbd-user-guide/linstor-guide-1_0-cn/#s-linstor_ha
Running drbd-reactorctl status
get err message when use ha linstor
Follow the documentation
https://linbit.com/drbd-user-guide/linstor-guide-1_0-cn/#s-linstor_ha
Not sure if this error caused linstor-gateway to fail to create iscsi.
here is linstor-gateway output (command : linstor-gateway iscsi create iqn.2025-01.rcr.test:ss 172.17.0.0/8 2G -r iso_res_group --loglevel debug
)
DEBU[0000] {"iqn":"iqn.2025-01.rcr.test:info","resource_group":"iso_res_group","volumes":[{"number":1,"size_kib":2097152,"file_system_root_owner":{"User":"","Group":""}}],"service_ips":["172.17.0.0/8"],"status":{"state":"Unknown","service":"Stopped","primary":"","nodes":null,"volumes":null},"gross_size":false,"implementation":""}
DEBU[0000] curl -X 'POST' -d '{"iqn":"iqn.2025-01.rcr.test:info","resource_group":"iso_res_group","volumes":[{"number":1,"size_kib":2097152,"file_system_root_owner":{"User":"","Group":""}}],"service_ips":["172.17.0.0/8"],"status":{"state":"Unknown","service":"Stopped","primary":"","nodes":null,"volumes":null},"gross_size":false,"implementation":""}
' -H 'Accept: application/json' -H 'Content-Type: application/json' -H 'User-Agent: linstor-gateway/1.7.0-g6e676b4f35e3e2b90cffb32637e44e16ae3c0559' 'http://localhost:8080/api/v2/iscsi'
DEBU[0000] Status code not within 200 to 400, but 400 (Bad Request)
ERRO[0000] failed to create iscsi resource: failed to retrieve existing configs: failed to fetch file list: Get "http://localhost:3370/v1/files?content=true&limit=0&offset=0": dial tcp [::1]:3370: connect: connection refused
and journalctl -xe
log
Looks like the linstor-controller is not actually runnning.
The error of drbd-reactorctl
is interesting. Can you show the output of:
systemctl show --property=FreezerState drbd-promote@linstor_db.service
systemctl show --property=FreezerState var-lib-linstor.mount
For the gateway error: you need to add all possible controller urls to the /etc/linstor-gateway/linstor-gateway.toml
file:
[linstor]
controllers = ["10.10.1.1", "10.10.1.2", "10.10.1.3"]
(Use the right DNS names/IP Addresses for your nodes), then restart the linstor-gateway
service
Still same error, regardless of whether to add /etc/linstor-gateway/linstor-gateway.toml
or not, I can see corresponding logs in linstor-controller, but the final result is failure.
Have you seen this error
Jan 10 16:09:43 node1 ocf-rs-wrapper[5496]: Jan 10 16:09:43 INFO: Running start for /dev/drbd/by-res/ss/0 on /srv/ha/internal/ss
Jan 10 16:09:43 node1 ocf-rs-wrapper[5496]: Jan 10 16:09:43 ERROR: There is one or more mounts mounted under /srv/ha/internal/ss.
Jan 10 16:09:43 node1 ocf-rs-wrapper[5492]: ERROR [ocf_rs_wrapper] Filesystem:fs_cluster_private_ss,s-a-m,start: FAILED with exit code 6
Is there already something mounted in /srv/ha/internal/ss?
I found that there are two lines of ExecStop in the Service section of drbd-promote@linstor_db.service
. Is this the reason for the abnormal output of drbd-
reactorctl?
I’m noticing this earlier shared journal output is from node1:
Jan 10 16:09:26 node1 ocf-rs-wrapper[5362]: Jan 10 16:09:26 INFO: Running start for /dev/drbd/by-res/ss/0 on /srv/ha/internal/ss
Jan 10 16:09:26 node1 ocf-rs-wrapper[5362]: Jan 10 16:09:26 ERROR: There is one or more mounts mounted under /srv/ha/internal/ss.
Jan 10 16:09:26 node1 ocf-rs-wrapper[5358]: ERROR [ocf_rs_wrapper] Filesystem:fs_cluster_private_ss,s-a-m,start: FAILED with exit code 6
Jan 10 16:09:26 node1 systemd[1]: ocf.rs@fs_cluster_private_ss.service: Main process exited, code=exited, status=6/NOTCONFIGURED
But the screenshot you shared for /srv
is from node2.
It seems like promotion was attempted on node1 based on this output, (which would necessarily preclude promotion on node2) but I am curious about what might be mounted under /srv/ha/internal/ss
for node1?
The results of node1 and node2 are the same. Sorry, I forgot to post the results related to node1 before.
If you guys need more logs and other information, fell free to contact me anytime.
In the source code for the Filesystem resource agent which is producing that error, I am seeing it checks /proc/mounts
and /etc/mtab
to determine if there are existing mounts mounted under the mount point.
After reconfirming the There is one or more mounts mounted
error presents in the journalctl output after attempting another linstor-gateway iscsi create
command (using a different unique name in your IQN and different IP address within your chosen subnet), do you find the path specified when you cat
either /proc/mounts
or /etc/mtab
on the same node with that error in its system journal?
New linstor-gateway cmd: linstor-gateway iscsi create iqn.2025-02.rcr.test:info1 172.17.1.0/23 2G -r iso_res_group --loglevel debug
And journalctl output:
and /proc/mounts with timestamp with 1 second interval:
and /etc/mtab also with timestamp with 1 second interval:
Additional, i find some log in dmesg -kT, I don’t know if these will help.
I see that you’ve opened and closed an issue in the DRBD Reactor Github regarding the FreezerState, I wanted to link that back here for future forum users:
https://github.com/LINBIT/drbd-reactor/issues/14
Regarding troubleshooting the iscsi creation error, it doesn’t seem like there’s an obvious culprit in your mounts via the methods we’ve used so far, so I would suggest a more granular review of the Filesystem resource agent logic, which is where we are seeing the fatal error in your case.
I reccomend modifying the resource agent script itself to temporarily add set -x
to do this. Navigate to /usr/lib/ocf/resource.d/heartbeat
(or wherever your resource agent directory is) on each node, and after making a backup copy of the Filesystem
file there, modify the original file to add set -x
on a new line at the beginning of the script, I have had success placing it above the #Defaults
line.
After you’ve modified those files, use the linstor-gateway iscsi create
command again so the actions of the resource agent may be captured in the system journal. Once performed, you can find the actions of the resource agent output to those system journals and that should provide more insight on what step it is failing and why.
Thanks, I’ll try it later.
Finally, after manually creating the folder, everything was ok.
mkdir -p /srv/ha/internal