Verify consistently fails after rebooting secondary node

(I also sent this to the user list but am copying here since that list is being deprecated)

We are observing the following issue with resync after reboot.

After rebooting a secondary node (in a 2 or 3 node cluster), the
secondary successfully connects to primary and reports UpToDate, but
when a verify is launched on the secondary node that was rebooted, it reports
out of sync blocks.

This was initially detected when we promoted a secondary node and it came up with
disk corruption. We traced this to the reboot occurring before the promotion.

At this point we’ve modified our system to invalidate the local disk when the node is rebooted, requiring a full resync.

We have not been able to verify corruption when no connection is made back to
another node after the reboot, but this is harder to validate as system may boot with corruption

What expectations should we have for integrity on a shutdown? Reboot? Power loss?

Versions

The logs attached are using the 9.2.12 version of the driver on the 5.15.173 kernel,
but we have also observed this issue on the 9.2.4 driver with the 5.15.166 kernel

Attachments

These can be found in the original email on the mailing list, I can’t attach text documents to this forum.

https://lists.linbit.com/pipermail/drbd-user/2024-December/026680.html

initsyncandverify_noreboot.txt - drbd logs from system prior to reboot , includes
verify before reboot

verify_after_invalidate_no_reset.txt - drbd logs after reboot show initial failed
verify then, invalidate, then successful verify

dynamic.res - drbd conf file - note use of separate metadata disk

Secondary Bring Up

Secondary nodes enable drbd “persist” resource as follows

“”"
da up all || true
da secondary persist || true
da disconnect persist || true
da – --discard-my-data connect persist || true
“”"

1 Like

Thanks for bringing this over from the mailing list.

What kind of hardware is used in the storage stack here? Any write caches/buffers below DRBD here?

I haven’t been able to recreate this in the few hours I’ve been testing on some VMs, but I’m still trying.

In the past there have been some tight race conditions that result in false positives during online verification. To rule out a false positive, manual checks can be done on the reported blocks.

For example, from your logs the line shows where on disk DRBD found out of sync blocks:

[  558.196621] drbd persist/0 drbd0 TestVerify111-3: Out of sync: start=12589488, size=8 (sectors)

You can use dd and xxd to compare the data on disk at those positions. DRBD will always report in sectors, so 512 byte block sizes. You can use the start sector from the logs as the skip count, 512 as the bs (block size), and size as the count in a dd command, and then pipe that to xxd to view the data on disk at those sectors. Use the DRBD backing device as the infile (if=).

Something like this, using values from your log line above:

dd if=/dev/sdb skip=12589488 size=512 count=8 status=none 2>&1 | xxd

You’ll be able to see the binary data as hex, which can then be compared against the same section of disk from the peer. You can “get fancy” and use ssh and diff to make it even easier to see the differences:

diff -W 140 -y \
    <(dd if=/dev/vdb skip=8000000 bs=512 count=1 status=none 2>&1 | xxd) \
    <(ssh root@192.168.222.21 'dd if=/dev/vdb skip=8000000 bs=512 count=1 status=none 2>&1 | xxd')

00000000: d76a 3c38 5474 d507 c595 9f22 d5bc ecbf  .j<8Tt....."....  |  00000000: 0010 8001 0000 0000 23f0 6acc 8e82 4e0d  ........#.j...N.
00000010: 9d26 b121 a674 3aa4 e435 a339 fb62 c79a  .&.!.t:..5.9.b..  |  00000010: 04de 7b61 1c6a 9704 c07b 2d9d 85bc a00c  ..{a.j...{-.....
00000020: 8ff9 ec94 ca0b 608a 573c 2e45 fe5f c05b  ......`.W<.E._.[  |  00000020: 78af c5f5 f948 9115 efb5 fc66 3c2e 7d0c  x....H.....f<.}.
00000030: aa75 793e 3fcd 75cb 15dc 40f8 1cbd 9636  .uy>?.u...@....6  |  00000030: 4207 0000 0000 0000 c934 6107 0000 0000  B........4a.....
00000040: 6a3c a6db bbe4 53b4 2969 0035 4bfa a96b  j<....S.)i.5K..k  |  00000040: 0010 569c 0100 0000 0b0a 1144 2f54 bc11  ..V........D/T..
00000050: 9205 dd17 d667 87ff b984 7fe1 f393 e33b  .....g.........;  |  00000050: 41a1 7c63 7de8 ac17 2814 cf5b 51eb 3d1b  A.|c}...(..[Q.=.
00000060: fdac 3fdf 0997 7068 205e 2d8d aeb1 2774  ..?...ph ^-...'t  |  00000060: 85e2 65a1 820f d710 503c ea62 3d91 bc0e  ..e.....P<.b=...
00000070: 1a36 5115 2acb 9892 8ebc 9cba 0763 d47d  .6Q.*........c.}  |  00000070: 8a47 358e b2e0 680b f1a8 01ae fbd4 590b  .G5...h.......Y.
00000080: a4b2 b26e 486a 0eef 148a cfb3 9dac 1083  ...nHj..........  |  00000080: 0010 6018 0100 0000 a3f6 99f9 cf3b 2b14  ..`..........;+.
00000090: ff55 88a5 d759 cef3 7eb8 b458 00c6 80ce  .U...Y..~..X....  |  00000090: d4be e103 ad6a 5216 da37 12c1 64cb 3f00  .....jR..7..d.?.
000000a0: b86a cef0 6dab b2d9 7497 2da4 80d6 831a  .j..m...t.-.....  |  000000a0: fb46 357c e398 cb12 df28 09cc 01b5 0404  .F5|.....(......
000000b0: 4586 c744 9e7b 7773 392a f9b8 a4e1 3bf5  E..D.{ws9*....;.  |  000000b0: 1ba5 11a5 9bb0 5818 a3b4 14a2 8a43 ba03  ......X......C..
000000c0: 2cfa cd73 6359 e9c7 4723 1c68 9c37 fd23  ,..scY..G#.h.7.#  |  000000c0: 0010 0ac4 0000 0000 d222 f453 f461 a50d  .........".S.a..
000000d0: 5a01 9a1c 2634 7f88 0939 c3f5 f81d 72da  Z...&4...9....r.  |  000000d0: 5a84 1579 4482 f817 8bb0 f5cc bdd3 d211  Z..yD...........
000000e0: 2efa 34e7 65d1 6e7d f63e 002c a3db f198  ..4.e.n}.>.,....  |  000000e0: 1136 59e1 fc73 6805 c2a6 2201 d39d e510  .6Y..sh...".....
000000f0: 80e4 3d96 0383 72b5 327d 7941 bd3d 5f5d  ..=...r.2}yA.=_]  |  000000f0: d854 c30c 2933 011a 9b6a 2c77 03a0 9305  .T..)3...j,w....
00000100: 7b81 888f e598 080b f555 5c96 ea55 40ee  {........U\..U@.  |  00000100: 0010 784b 0100 0000 aa81 8940 419a 550a  ..xK.......@A.U.
00000110: c4af 0413 8b23 206e e53d 518a 3e9b c507  .....# n.=Q.>...  |  00000110: 3530 3c67 0313 501d 0606 cd54 c24e 5b02  50<g..P....T.N[.
00000120: 65fb 3b57 aa5a 3d29 e76b 0a40 96f8 a4a3  e.;W.Z=).k.@....  |  00000120: c0a0 96c7 713f 9419 18d4 92e8 22a4 b918  ....q?......"...
00000130: 2d73 9b5d 3a8c 5de5 5c79 8058 de8d 91ff  -s.]:.].\y.X....  |  00000130: 835a 06f3 3ac0 ad17 504b 1f31 843e ad17  .Z..:...PK.1.>..
00000140: 9256 518b 44a5 1713 fef3 7cc2 eb2b aa7e  .VQ.D.....|..+.~  |  00000140: 0010 4a49 0000 0000 2d7d 5a27 aa01 a20b  ..JI....-}Z'....
00000150: 779d 7fef 8659 8258 5319 579f 095c ec8d  w....Y.XS.W..\..  |  00000150: a5cf 5406 882c 9a05 f419 f898 6642 b212  ..T..,......fB..
00000160: 390c a5f2 3786 dc25 9d83 7646 bdb9 97ef  9...7..%..vF....  |  00000160: 3e03 25c6 d0fb ce09 67a0 2597 677c 8302  >.%.....g.%.g|..
00000170: d0e7 80b7 ede2 c228 2e51 389c d9f1 57cd  .......(.Q8...W.  |  00000170: 0c34 b122 faa3 2d17 8126 50aa 26a3 3c0a  .4."..-..&P.&.<.
00000180: 4094 fc76 dd77 c7c0 e701 23ac 6f14 d4ea  @..v.w....#.o...  |  00000180: 0010 1a2c 0000 0000 9a30 d9d5 e206 3110  ...,.....0....1.
00000190: acab 7f95 62e2 0c93 6a1c 60a5 c34d d511  ....b...j.`..M..  |  00000190: 1326 6ec2 efb5 4a14 c244 44e5 8675 5609  .&n...J..DD..uV.
000001a0: 19e9 bfb3 3d67 64ea 5f67 15d9 9ac7 99ca  ....=gd._g......  |  000001a0: 9888 47ba 0e5c 351f 13f1 fc12 5e4e 070c  ..G..\5.....^N..
000001b0: f907 fa38 37fe cb72 8bc9 c7b8 a8ab d63a  ...87..r.......:  |  000001b0: 221e d649 4de0 1609 c4c3 299a 1e37 6a03  "..IM.....)..7j.
000001c0: bf0b 3c43 aaa2 4598 b635 b1c2 fe3b 77f1  ..<C..E..5...;w.  |  000001c0: 0010 7e2a 0100 0000 0f67 f041 88c6 2714  ..~*.....g.A..'.
000001d0: cdf4 9709 3463 fd82 5487 c816 b009 2d26  ....4c..T.....-&  |  000001d0: e18c b6d4 d8d7 e918 9c51 26d4 9fd0 170d  .........Q&.....
000001e0: 2b7d a4ad 1e83 4985 95a7 de75 bccb 439b  +}....I....u..C.  |  000001e0: 33ca b6d1 0010 971d 4659 1db5 2479 5712  3.......FY..$yW.
000001f0: b713 5371 2728 96b9 48f7 7087 235a d945  ..Sq'(..H.p.#Z.E  |  000001f0: 28ab 00ea 9594 221d 6515 ac67 925d 711d  (.....".e..g.]q.

Example with out of sync blocks.