Error: Failed to query free space from storage pool

With Latest Linstor Controller and Satellite (1.27.0 to 1.29.0) update, we are facing issues while creating VM in Cloudstack, The Linstor Node Servers are having Increase in CPU Load due to queries made by the Linstor. We have checked the Logs and found some queries are made by the linstor .

root@satellite-1:/# systemctl status linstor-satellite.service

● linstor-satellite.service - LINSTOR Satellite Service

 Loaded: loaded (/lib/systemd/system/linstor-satellite.service; enabled; vendor preset: enabled)

 Active: active (running) since Tue 2024-08-20 19:44:38 IST; 45min ago

Main PID: 1914 (java)

  Tasks: 169 (limit: 115718)

 Memory: 606.8M

    CPU: 8min 53.014s

 CGroup: /system.slice/linstor-satellite.service

         ├─ 1914 java -Xms32M -classpath "/usr/share/linstor-server/lib/conf:/usr/share/linstor-server/lib/*" com.linbit.linstor.core.Satellite --logs=/var/log/linstor-satellite --config-directory=/>

         ├─ 2630 drbdsetup events2 all

         ├─13105 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13117 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13138 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13157 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13180 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13221 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13230 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13324 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         ├─13367 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

         └─13373 vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

Aug 20 20:21:30 satellite-1 Satellite[1914]: 2024-08-20 20:21:30.299 [MainWorkerPool-8] ERROR LINSTOR/Satellite/0000f9 SYSTEM - Failed to query free space from storage pool [Report number 66C4A4CF-47558>

Aug 20 20:23:41 satellite-1 Satellite[1914]: 2024-08-20 20:23:41.168 [MainWorkerPool-8] ERROR LINSTOR/Satellite/0000f9 SYSTEM - Failed to query free space from storage pool [Report number 66C4A4CF-47558>

Aug 20 20:23:41 satellite-1 Satellite[1914]: 2024-08-20 20:23:41.195 [DeviceManager] INFO LINSTOR/Satellite/0b3bb5 SYSTEM - Lv copy created cs-83d26097-8980-493d-a5df-2de219df0eb4_00000/cs-934eb77a-1d2>

Aug 20 20:23:41 satellite-1 Satellite[1914]: 2024-08-20 20:23:41.195 [DeviceManager] INFO LINSTOR/Satellite/0b3bb5 SYSTEM - Volume number 0 of resource ‘cs-83d26097-8980-493d-a5df-2de219df0eb4’ [LVM-Th>

Aug 20 20:23:41 satellite-1 Satellite[1914]: 2024-08-20 20:23:41.413 [MainWorkerPool-9] INFO LINSTOR/Satellite/0000fb SYSTEM - SpaceInfo: DfltDisklessStorPool → 9223372036854775807/9223372036854775807

Aug 20 20:25:51 satellite-1 Satellite[1914]: 2024-08-20 20:25:51.580 [MainWorkerPool-9] ERROR LINSTOR/Satellite/0000fb SYSTEM - Failed to query free space from storage pool [Report number 66C4A4CF-47558>

Aug 20 20:28:01 satellite-1 Satellite[1914]: 2024-08-20 20:28:01.691 [MainWorkerPool-9] ERROR LINSTOR/Satellite/0000fb SYSTEM - Failed to query free space from storage pool [Report number 66C4A4CF-47558>

Aug 20 20:28:02 satellite-1 Satellite[1914]: 2024-08-20 20:28:02.123 [DeviceManager] INFO LINSTOR/Satellite/ SYSTEM - End DeviceManager cycle 35

Aug 20 20:28:02 satellite-1 Satellite[1914]: 2024-08-20 20:28:02.322 [DeviceManager] INFO LINSTOR/Satellite/e775f3 SYSTEM - Begin DeviceManager cycle 36

Aug 20 20:28:03 satellite-1 Satellite[1914]: 2024-08-20 20:28:03.303 [MainWorkerPool-9] INFO LINSTOR/Satellite/0000ff SYSTEM - SpaceInfo: DfltDisklessStorPool → 9223372036854775807/9223372036854775807

root@satellite-1:/#

Logs

root@master:/var/log/linstor-satellite# cat ErrorReport-66C4A313-E4C91-000001.log

ERROR REPORT 66C4A313-E4C91-000001

============================================================

Application: LINBIT® LINSTOR

Module: Satellite

Version: 1.29.0

Build ID: b2be7208a777f0743d4c7187062678cd5416fccf

Build time: 2024-07-31T11:02:51+00:00

Error time: 2024-08-20 20:23:46

Node: master

Thread: MainWorkerPool-15

============================================================

Reported error:

===============

Category: RuntimeException

Class name: ApiRcException

Class canonical name: com.linbit.linstor.core.apicallhandler.response.ApiRcException

Generated at: Method ‘getStoragePoolSpaceInfoOrError’, Source file ‘StltApiCallHandlerUtils.java’, Line #297

Error message: Failed to query free space from storage pool

This is interesting because the release notes state changes to some changes surrounding how LVM is queried by LINSTOR:

With this new version we bring 2 really big performance improvements:
1. All satellite read-only internal API calls now run concurrently.
    This means commands like linstor r l or v l don't need to wait for the device manager
    to finish on a satellite and should always return within a few seconds.
    Before if a satellite had a high amount of resources and it needed to adjust them all
    it could take a few minutes until the whole controller/satellite was responsive again.
2. We reduced the amount of LVM command calls by around 45%.
    This helps on systems where there are a lot of logical volumes and LVM begins to struggle.

There are also some minor fixes/tweaks which you can read below.

## [1.29.0] - 2024-07-31

### Changed

- Improved responsiveness: Allow some API calls to be executed concurrently on the satellite
- Added a cache for lvs and vgs/pvs commands, reducing the need to call them
- Changed logback log format to include logid and full timestamp

### Fixed

- Snap, create: Fixed possible deadlock
- Incorrect resource definition already exists error message
- sos-report: Controller not always dumping full log content

We will keep an eye out for this.

How often is this happening? Is it on many nodes, or just one? Have you been able to clear this up, and how?

Does running the query command manually take a long time to return?

vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator ";" --units k --noheadings --nosuffix

We were not facing this issue with ver 1.27. We have observed that during vm creation from cloudstack the issue occurs and load getting increased. we are having 3 node (1 combined and 2 satellites) and are using dis-aggregate architecture with other KVM host node configured as diskless. Issue sometimes get cleared with node reboot.

Is there any way to get the binaries for ver 1.27 (Ubuntu OS)

could you confirm whether command to be run in controller or satellite nodes
vgs -o vg_name,vg_extent_size,vg_size,vg_free --separator “;” --units k --noheadings --nosuffix

Do you have a proper global_filter set in your lvm.conf?

it should be something like:
global_filter = [ "r|^/dev/drbd|" ]

currently it is using default. do we need to change this settings to global_filter = [ “r|^/dev/drbd|” ]

# Configuration option devices/global_filter.
# Limit the block devices that are used by LVM system components.
# Because devices/filter may be overridden from the command line, it is
# not suitable for system-wide device filtering, e.g. udev.
# Use global_filter to hide devices from these LVM system components.
# The syntax is the same as devices/filter. Devices rejected by
# global_filter are not opened by LVM.
# This configuration option has an automatic default value.
# global_filter = [ “a|.*|” ]

Yes it is highly recommended to do this on all nodes that use LVM & DRBD.
Otherwise any DRBD suspend(snapshots) will block any other LVM operation.

Is there any documentation in linstor regarding this. In configuration, this step not mentioned in linstor doc. Also sometimes vm are getting created normally ( within 2-5 min) else it is taking lots of time and increase in cpu load.

It is currently getting added.
We tried to handle this within Linstor, but oft encountered situations where it wasn’t enough and the only way to fix it is to add the global_filter option.

After changing the global settings, currently it is working fine. we are observing. will update if any issue rising.

1 Like