[DRBD-user] Various DRBD issues (bugs?)

Matyas Koszik koszik at atw.hu
Fri May 21 02:43:18 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.



Hello,


I'm using drbd for a while now, but unfortunately I've ran into some trouble.

I'm using 8.3.7 on 2.6.32-3-amd64 on both nodes.


The first issue is that some of my drbd sessions keep timing out, while the others work fine most of the time.
The network itself seems fine (I see every packet on both sides), I'm using stock debian kernels with standard e1000
controllers, nothing fancy here. These are the less utilised sessions, the one with high traffic doesn't exhibit this
problem so often. These are also fine when there's a resync in progress. It's odd that it takes only 15 seconds to
drop a connection due to not receiving PingAck, when ping-int(10s) + ping-timeout(10s) is 20 seconds.


Log from the second node (10.0.0.38):

13:53:44 info kernel: [1179017.992730] block drbd4: Resync done (total 11 sec; paused 0 sec; 0 K/sec)
13:53:44 info kernel: [1179017.992739] block drbd4: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
13:53:59 err kernel: [1179033.098132] block drbd4: PingAck did not arrive in time.
13:53:59 info kernel: [1179033.163921] block drbd4: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
13:53:59 info kernel: [1179033.163928] block drbd4: asender terminated
13:53:59 info kernel: [1179033.163930] block drbd4: Terminating asender thread
13:53:59 info kernel: [1179033.163937] block drbd4: Creating new current UUID
13:53:59 err kernel: [1179033.163947] block drbd4: short read expecting header on sock: r=-512
13:53:59 info kernel: [1179033.242297] block drbd4: Connection closed
13:53:59 info kernel: [1179033.242301] block drbd4: conn( NetworkFailure -> Unconnected )
13:53:59 info kernel: [1179033.242305] block drbd4: receiver terminated
13:53:59 info kernel: [1179033.242307] block drbd4: Restarting receiver thread
13:53:59 info kernel: [1179033.242308] block drbd4: receiver (re)started
13:53:59 info kernel: [1179033.242312] block drbd4: conn( Unconnected -> WFConnection )
13:54:00 info kernel: [1179033.549223] block drbd4: Handshake successful: Agreed network protocol version 91
13:54:00 info kernel: [1179033.549282] block drbd4: conn( WFConnection -> WFReportParams )
13:54:00 info kernel: [1179033.549299] block drbd4: Starting asender thread (from drbd4_receiver [3978])

13:53:49.577936 IP 10.0.0.38.59203 > 10.0.0.34.7789: P 206963062:206963078(16) ack 183495174 win 24297 <nop,nop,timestamp 295258751 275521053>
13:53:49.578137 IP 10.0.0.34.7789 > 10.0.0.38.59203: . ack 206963078 win 24297 <nop,nop,timestamp 275521058 295258751>
13:53:49.583862 IP 10.0.0.34.43168 > 10.0.0.38.7789: P 182081:182113(32) ack 25 win 560 <nop,nop,timestamp 275521060 295258751>
13:53:49.583900 IP 10.0.0.34.43168 > 10.0.0.38.7789: P 182113:182129(16) ack 25 win 560 <nop,nop,timestamp 275521060 295258751>
13:53:49.583922 IP 10.0.0.38.7789 > 10.0.0.34.43168: . ack 182113 win 24297 <nop,nop,timestamp 295258753 275521060>
13:53:49.583933 IP 10.0.0.38.7789 > 10.0.0.34.43168: . ack 182129 win 24297 <nop,nop,timestamp 295258753 275521060>
13:53:51.735077 IP 10.0.0.38.59203 > 10.0.0.34.7789: P 206963078:206963086(8) ack 183495174 win 24297 <nop,nop,timestamp 295259291 275521058>
13:53:51.736179 IP 10.0.0.34.7789 > 10.0.0.38.59203: . ack 206963086 win 24297 <nop,nop,timestamp 275521598 295259291>
13:53:59.685209 IP 10.0.0.38.59203 > 10.0.0.34.7789: P 206963086:206963142(56) ack 183495174 win 24297 <nop,nop,timestamp 295261278 275521598>
13:53:59.685290 IP 10.0.0.34.7789 > 10.0.0.38.59203: . ack 206963142 win 24297 <nop,nop,timestamp 275523585 295261278>
13:53:59.726371 IP 10.0.0.38.59203 > 10.0.0.34.7789: F 206963142:206963142(0) ack 183495174 win 24297 <nop,nop,timestamp 295261289 275523585>
13:53:59.726456 IP 10.0.0.38.7789 > 10.0.0.34.43168: F 25:25(0) ack 182129 win 24297 <nop,nop,timestamp 295261289 275521060>
13:53:59.726573 IP 10.0.0.38.47010 > 10.0.0.34.7789: S 2744149259:2744149259(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.726587 IP 10.0.0.38.54732 > 10.0.0.34.7789: S 2745474681:2745474681(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.726596 IP 10.0.0.38.59106 > 10.0.0.34.7789: S 2733562364:2733562364(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.726714 IP 10.0.0.34.7789 > 10.0.0.38.47010: R 0:0(0) ack 2744149260 win 0
13:53:59.726736 IP 10.0.0.38.49802 > 10.0.0.34.7789: S 2745444990:2745444990(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.726738 IP 10.0.0.34.7789 > 10.0.0.38.59106: R 0:0(0) ack 2733562365 win 0
13:53:59.726818 IP 10.0.0.34.7789 > 10.0.0.38.49802: R 0:0(0) ack 2745444991 win 0
13:53:59.726932 IP 10.0.0.34.7789 > 10.0.0.38.54732: R 0:0(0) ack 2745474682 win 0
13:53:59.728643 IP 10.0.0.34.7789 > 10.0.0.38.59203: F 183495174:183495174(0) ack 206963143 win 24297 <nop,nop,timestamp 275523596 295261289>
13:53:59.728665 IP 10.0.0.38.59203 > 10.0.0.34.7789: . ack 183495175 win 24297 <nop,nop,timestamp 295261289 275523596>
13:53:59.728678 IP 10.0.0.34.7789 > 10.0.0.38.59203: R 183495175:183495175(0) ack 206963143 win 24297 <nop,nop,timestamp 275523596 295261289>
13:53:59.728738 IP 10.0.0.34.43168 > 10.0.0.38.7789: F 182129:182129(0) ack 26 win 560 <nop,nop,timestamp 275523596 295261289>
13:53:59.728764 IP 10.0.0.38.7789 > 10.0.0.34.43168: . ack 182130 win 24297 <nop,nop,timestamp 295261289 275523596>
13:53:59.728785 IP 10.0.0.34.7789 > 10.0.0.38.59203: R 2338233396:2338233396(0) win 0
13:53:59.736424 IP 10.0.0.34.36727 > 10.0.0.38.7789: S 2616550861:2616550861(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.736456 IP 10.0.0.34.39305 > 10.0.0.38.7789: S 2618111662:2618111662(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.736447 IP 10.0.0.38.7789 > 10.0.0.34.36727: R 0:0(0) ack 2616550862 win 0
13:53:59.736477 IP 10.0.0.34.49205 > 10.0.0.38.7789: S 2612359424:2612359424(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.736495 IP 10.0.0.38.7789 > 10.0.0.34.49205: R 0:0(0) ack 2612359425 win 0
13:53:59.736499 IP 10.0.0.34.41517 > 10.0.0.38.7789: S 2623980617:2623980617(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.736505 IP 10.0.0.38.7789 > 10.0.0.34.39305: R 0:0(0) ack 2618111663 win 0
13:53:59.736510 IP 10.0.0.38.7789 > 10.0.0.34.41517: R 0:0(0) ack 2623980618 win 0
13:53:59.826056 IP 10.0.0.38.52304 > 10.0.0.34.7789: S 2749276809:2749276809(0) win 17920 <mss 8960,sackOK,timestamp 295261314 0,nop,wscale 5>
13:53:59.826236 IP 10.0.0.34.7789 > 10.0.0.38.52304: R 0:0(0) ack 2749276810 win 0
13:53:59.836909 IP 10.0.0.34.49818 > 10.0.0.38.7789: S 2615558714:2615558714(0) win 17920 <mss 8960,sackOK,timestamp 275523623 0,nop,wscale 5>
13:53:59.836932 IP 10.0.0.38.7789 > 10.0.0.34.49818: R 0:0(0) ack 2615558715 win 0
13:53:59.926116 IP 10.0.0.38.47866 > 10.0.0.34.7789: S 2738108873:2738108873(0) win 17920 <mss 8960,sackOK,timestamp 295261339 0,nop,wscale 5>
13:53:59.926333 IP 10.0.0.34.7789 > 10.0.0.38.47866: R 0:0(0) ack 2738108874 win 0
13:53:59.936935 IP 10.0.0.34.36152 > 10.0.0.38.7789: S 2616179140:2616179140(0) win 17920 <mss 8960,sackOK,timestamp 275523648 0,nop,wscale 5>
13:53:59.936960 IP 10.0.0.38.7789 > 10.0.0.34.36152: S 2740425796:2740425796(0) ack 2616179141 win 17896 <mss 8960,sackOK,timestamp 295261341 275523648,nop,wscale 5>
13:53:59.937144 IP 10.0.0.34.36152 > 10.0.0.38.7789: . ack 1 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.937170 IP 10.0.0.34.36152 > 10.0.0.38.7789: P 1:9(8) ack 1 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.937179 IP 10.0.0.38.7789 > 10.0.0.34.36152: . ack 9 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.937245 IP 10.0.0.38.44724 > 10.0.0.34.7789: S 2748622065:2748622065(0) win 17920 <mss 8960,sackOK,timestamp 295261341 0,nop,wscale 5>
13:53:59.937374 IP 10.0.0.34.7789 > 10.0.0.38.44724: S 2611311363:2611311363(0) ack 2748622066 win 17896 <mss 8960,sackOK,timestamp 275523648 295261341,nop,wscale 5>
13:53:59.937394 IP 10.0.0.38.44724 > 10.0.0.34.7789: . ack 1 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.937453 IP 10.0.0.38.44724 > 10.0.0.34.7789: P 1:9(8) ack 1 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.937568 IP 10.0.0.34.36152 > 10.0.0.38.7789: P 9:89(80) ack 1 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.937582 IP 10.0.0.38.7789 > 10.0.0.34.36152: . ack 89 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.937798 IP 10.0.0.34.7789 > 10.0.0.38.44724: . ack 9 win 560 <nop,nop,timestamp 275523648 295261341>
13:54:00.034067 IP 10.0.0.38.7789 > 10.0.0.34.36152: P 1:81(80) ack 89 win 560 <nop,nop,timestamp 295261366 275523648>


Log from the first node (10.0.0.34):

13:53:44 info kernel: [1100217.286005] block drbd4: Resync done (total 11 sec; paused 0 sec; 0 K/sec)
13:53:44 info kernel: [1100217.286013] block drbd4: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
13:53:44 info kernel: [1100217.286038] block drbd4: helper command: /sbin/drbdadm after-resync-target minor-4
13:53:44 info kernel: [1100217.288378] block drbd4: helper command: /sbin/drbdadm after-resync-target minor-4 exit code 0 (0x0)
13:53:59 err kernel: [1100232.386269] block drbd4: PingAck did not arrive in time.
13:53:59 info kernel: [1100232.451859] block drbd4: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
13:53:59 info kernel: [1100232.451867] block drbd4: asender terminated
13:53:59 info kernel: [1100232.451870] block drbd4: Terminating asender thread
13:53:59 err kernel: [1100232.451897] block drbd4: short read expecting header on sock: r=-512
13:53:59 info kernel: [1100232.537720] block drbd4: Connection closed
13:53:59 info kernel: [1100232.537726] block drbd4: conn( NetworkFailure -> Unconnected )
13:53:59 info kernel: [1100232.537731] block drbd4: receiver terminated
13:53:59 info kernel: [1100232.537734] block drbd4: Restarting receiver thread
13:53:59 info kernel: [1100232.537736] block drbd4: receiver (re)started
13:53:59 info kernel: [1100232.537741] block drbd4: conn( Unconnected -> WFConnection )
13:53:59 info kernel: [1100232.835022] block drbd4: Handshake successful: Agreed network protocol version 91
13:53:59 info kernel: [1100232.835028] block drbd4: conn( WFConnection -> WFReportParams )
13:53:59 info kernel: [1100232.835083] block drbd4: Starting asender thread (from drbd4_receiver [4795])
13:53:59 info kernel: [1100232.835324] block drbd4: data-integrity-alg: <not-used>

13:53:49.164490 IP 10.0.0.38.59203 > 10.0.0.34.7789: P 206963062:206963078(16) ack 183495174 win 24297 <nop,nop,timestamp 295258751 275521053>
13:53:49.164555 IP 10.0.0.34.7789 > 10.0.0.38.59203: . ack 206963078 win 24297 <nop,nop,timestamp 275521058 295258751>
13:53:49.170348 IP 10.0.0.34.43168 > 10.0.0.38.7789: P 182081:182113(32) ack 25 win 560 <nop,nop,timestamp 275521060 295258751>
13:53:49.170367 IP 10.0.0.34.43168 > 10.0.0.38.7789: P 182113:182129(16) ack 25 win 560 <nop,nop,timestamp 275521060 295258751>
13:53:49.170552 IP 10.0.0.38.7789 > 10.0.0.34.43168: . ack 182113 win 24297 <nop,nop,timestamp 295258753 275521060>
13:53:49.170560 IP 10.0.0.38.7789 > 10.0.0.34.43168: . ack 182129 win 24297 <nop,nop,timestamp 295258753 275521060>
13:53:51.321236 IP 10.0.0.38.59203 > 10.0.0.34.7789: P 206963078:206963086(8) ack 183495174 win 24297 <nop,nop,timestamp 295259291 275521058>
13:53:51.321259 IP 10.0.0.34.7789 > 10.0.0.38.59203: . ack 206963086 win 24297 <nop,nop,timestamp 275521598 295259291>
13:53:59.269893 IP 10.0.0.38.59203 > 10.0.0.34.7789: P 206963086:206963142(56) ack 183495174 win 24297 <nop,nop,timestamp 295261278 275521598>
13:53:59.269904 IP 10.0.0.34.7789 > 10.0.0.38.59203: . ack 206963142 win 24297 <nop,nop,timestamp 275523585 295261278>
13:53:59.311044 IP 10.0.0.38.59203 > 10.0.0.34.7789: F 206963142:206963142(0) ack 183495174 win 24297 <nop,nop,timestamp 295261289 275523585>
13:53:59.311134 IP 10.0.0.38.7789 > 10.0.0.34.43168: F 25:25(0) ack 182129 win 24297 <nop,nop,timestamp 295261289 275521060>
13:53:59.311251 IP 10.0.0.38.47010 > 10.0.0.34.7789: S 2744149259:2744149259(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.311268 IP 10.0.0.34.7789 > 10.0.0.38.47010: R 0:0(0) ack 2744149260 win 0
13:53:59.311332 IP 10.0.0.38.59106 > 10.0.0.34.7789: S 2733562364:2733562364(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.311347 IP 10.0.0.34.7789 > 10.0.0.38.59106: R 0:0(0) ack 2733562365 win 0
13:53:59.311410 IP 10.0.0.38.49802 > 10.0.0.34.7789: S 2745444990:2745444990(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.311422 IP 10.0.0.34.7789 > 10.0.0.38.49802: R 0:0(0) ack 2745444991 win 0
13:53:59.311502 IP 10.0.0.38.54732 > 10.0.0.34.7789: S 2745474681:2745474681(0) win 17920 <mss 8960,sackOK,timestamp 295261289 0,nop,wscale 5>
13:53:59.311539 IP 10.0.0.34.7789 > 10.0.0.38.54732: R 0:0(0) ack 2745474682 win 0
13:53:59.313212 IP 10.0.0.34.7789 > 10.0.0.38.59203: F 183495174:183495174(0) ack 206963143 win 24297 <nop,nop,timestamp 275523596 295261289>
13:53:59.313300 IP 10.0.0.34.7789 > 10.0.0.38.59203: R 183495175:183495175(0) ack 206963143 win 24297 <nop,nop,timestamp 275523596 295261289>
13:53:59.313319 IP 10.0.0.34.43168 > 10.0.0.38.7789: F 182129:182129(0) ack 26 win 560 <nop,nop,timestamp 275523596 295261289>
13:53:59.313401 IP 10.0.0.38.59203 > 10.0.0.34.7789: . ack 183495175 win 24297 <nop,nop,timestamp 295261289 275523596>
13:53:59.313410 IP 10.0.0.34.7789 > 10.0.0.38.59203: R 2338233396:2338233396(0) win 0
13:53:59.313500 IP 10.0.0.38.7789 > 10.0.0.34.43168: . ack 182130 win 24297 <nop,nop,timestamp 295261289 275523596>
13:53:59.320979 IP 10.0.0.34.39305 > 10.0.0.38.7789: S 2618111662:2618111662(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.321045 IP 10.0.0.34.36727 > 10.0.0.38.7789: S 2616550861:2616550861(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.321060 IP 10.0.0.34.49205 > 10.0.0.38.7789: S 2612359424:2612359424(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.321080 IP 10.0.0.34.41517 > 10.0.0.38.7789: S 2623980617:2623980617(0) win 17920 <mss 8960,sackOK,timestamp 275523597 0,nop,wscale 5>
13:53:59.321145 IP 10.0.0.38.7789 > 10.0.0.34.36727: R 0:0(0) ack 2616550862 win 0
13:53:59.321174 IP 10.0.0.38.7789 > 10.0.0.34.39305: R 0:0(0) ack 2618111663 win 0
13:53:59.321191 IP 10.0.0.38.7789 > 10.0.0.34.49205: R 0:0(0) ack 2612359425 win 0
13:53:59.321196 IP 10.0.0.38.7789 > 10.0.0.34.41517: R 0:0(0) ack 2623980618 win 0
13:53:59.410774 IP 10.0.0.38.52304 > 10.0.0.34.7789: S 2749276809:2749276809(0) win 17920 <mss 8960,sackOK,timestamp 295261314 0,nop,wscale 5>
13:53:59.410796 IP 10.0.0.34.7789 > 10.0.0.38.52304: R 0:0(0) ack 2749276810 win 0
13:53:59.421456 IP 10.0.0.34.49818 > 10.0.0.38.7789: S 2615558714:2615558714(0) win 17920 <mss 8960,sackOK,timestamp 275523623 0,nop,wscale 5>
13:53:59.421659 IP 10.0.0.38.7789 > 10.0.0.34.49818: R 0:0(0) ack 2615558715 win 0
13:53:59.510800 IP 10.0.0.38.47866 > 10.0.0.34.7789: S 2738108873:2738108873(0) win 17920 <mss 8960,sackOK,timestamp 295261339 0,nop,wscale 5>
13:53:59.510824 IP 10.0.0.34.7789 > 10.0.0.38.47866: R 0:0(0) ack 2738108874 win 0
13:53:59.521418 IP 10.0.0.34.36152 > 10.0.0.38.7789: S 2616179140:2616179140(0) win 17920 <mss 8960,sackOK,timestamp 275523648 0,nop,wscale 5>
13:53:59.521708 IP 10.0.0.38.7789 > 10.0.0.34.36152: S 2740425796:2740425796(0) ack 2616179141 win 17896 <mss 8960,sackOK,timestamp 295261341 275523648,nop,wscale 5>
13:53:59.521727 IP 10.0.0.34.36152 > 10.0.0.38.7789: . ack 1 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.521742 IP 10.0.0.34.36152 > 10.0.0.38.7789: P 1:9(8) ack 1 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.521930 IP 10.0.0.38.44724 > 10.0.0.34.7789: S 2748622065:2748622065(0) win 17920 <mss 8960,sackOK,timestamp 295261341 0,nop,wscale 5>
13:53:59.521954 IP 10.0.0.34.7789 > 10.0.0.38.44724: S 2611311363:2611311363(0) ack 2748622066 win 17896 <mss 8960,sackOK,timestamp 275523648 295261341,nop,wscale 5>
13:53:59.521976 IP 10.0.0.38.7789 > 10.0.0.34.36152: . ack 9 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.522034 IP 10.0.0.38.44724 > 10.0.0.34.7789: . ack 1 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.522090 IP 10.0.0.38.44724 > 10.0.0.34.7789: P 1:9(8) ack 1 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.522101 IP 10.0.0.34.7789 > 10.0.0.38.44724: . ack 9 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.522128 IP 10.0.0.34.36152 > 10.0.0.38.7789: P 9:89(80) ack 1 win 560 <nop,nop,timestamp 275523648 295261341>
13:53:59.522303 IP 10.0.0.38.7789 > 10.0.0.34.36152: . ack 89 win 560 <nop,nop,timestamp 295261341 275523648>
13:53:59.618708 IP 10.0.0.38.7789 > 10.0.0.34.36152: P 1:81(80) ack 89 win 560 <nop,nop,timestamp 295261366 275523648>
13:53:59.618753 IP 10.0.0.34.36152 > 10.0.0.38.7789: . ack 81 win 560 <nop,nop,timestamp 275523672 295261366>



I just saw that I get timeouts even when starting resync (but not when it's already pushing data):

May 21 01:18:17 store1 kernel: [2607128.789229] block drbd4: conn( Unconnected -> WFConnection )
May 21 01:18:17 store1 kernel: [2607129.089455] block drbd4: Handshake successful: Agreed network protocol version 91
May 21 01:18:17 store1 kernel: [2607129.089471] block drbd4: conn( WFConnection -> WFReportParams )
May 21 01:18:17 store1 kernel: [2607129.089486] block drbd4: Starting asender thread (from drbd4_receiver [13831])
May 21 01:18:17 store1 kernel: [2607129.089558] block drbd4: data-integrity-alg: <not-used>
May 21 01:18:17 store1 kernel: [2607129.089708] block drbd4: drbd_sync_handshake:
May 21 01:18:17 store1 kernel: [2607129.089711] block drbd4: self CA5AE152CD8CFEFF:F5AD3FCE8536F45F:227FB92BEF38C597:885FCB042C86DA8F bits:0 flags:0
May 21 01:18:17 store1 kernel: [2607129.089714] block drbd4: peer F5AD3FCE8536F45E:0000000000000000:227FB92BEF38C596:885FCB042C86DA8F bits:0 flags:0
May 21 01:18:17 store1 kernel: [2607129.089717] block drbd4: uuid_compare()=1 by rule 70
May 21 01:18:17 store1 kernel: [2607129.089721] block drbd4: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
May 21 01:18:22 store1 kernel: [2607133.222773] block drbd4: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
May 21 01:18:22 store1 kernel: [2607133.222781] block drbd4: Began resync as SyncSource (will sync 0 KB [0 bits set]).
May 21 01:18:32 store1 kernel: [2607143.202222] block drbd4: PingAck did not arrive in time.
--
May 21 01:18:17 store2 kernel: [89815.905398] block drbd4: conn( Unconnected -> WFConnection )
May 21 01:18:17 store2 kernel: [89816.260808] block drbd4: Handshake successful: Agreed network protocol version 91
May 21 01:18:17 store2 kernel: [89816.260814] block drbd4: conn( WFConnection -> WFReportParams )
May 21 01:18:17 store2 kernel: [89816.260825] block drbd4: Starting asender thread (from drbd4_receiver [3967])
May 21 01:18:17 store2 kernel: [89816.261038] block drbd4: data-integrity-alg: <not-used>
May 21 01:18:18 store2 kernel: [89816.261176] block drbd4: drbd_sync_handshake:
May 21 01:18:18 store2 kernel: [89816.261179] block drbd4: self F5AD3FCE8536F45E:0000000000000000:227FB92BEF38C596:885FCB042C86DA8F bits:0 flags:0
May 21 01:18:18 store2 kernel: [89816.261183] block drbd4: peer CA5AE152CD8CFEFF:F5AD3FCE8536F45F:227FB92BEF38C597:885FCB042C86DA8F bits:0 flags:0
May 21 01:18:18 store2 kernel: [89816.261185] block drbd4: uuid_compare()=-1 by rule 50
May 21 01:18:18 store2 kernel: [89816.261189] block drbd4: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 21 01:18:22 store2 kernel: [89820.381905] block drbd4: conn( WFBitMapT -> WFSyncUUID )
May 21 01:18:22 store2 kernel: [89820.394270] block drbd4: helper command: /sbin/drbdadm before-resync-target minor-4
May 21 01:18:22 store2 kernel: [89820.397180] block drbd4: helper command: /sbin/drbdadm before-resync-target minor-4 exit code 0 (0x0)
May 21 01:18:22 store2 kernel: [89820.397186] block drbd4: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
May 21 01:18:22 store2 kernel: [89820.397192] block drbd4: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
May 21 01:18:32 store2 kernel: [89830.376024] block drbd4: PingAck did not arrive in time.


The connection is established at 01:18:17, timeout at :32, 15 seconds later, 10 seconds after the last logged action.


And now the second issue.

I didn't have ko-count set, but I still don't see how this is possible:

May 20 13:03:04 store1 kernel: [2563102.771995] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967295
May 20 13:03:10 store1 kernel: [2563108.856700] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967294
May 20 13:03:17 store1 kernel: [2563114.941474] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967293
May 20 13:03:23 store1 kernel: [2563121.026493] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967292
May 20 13:03:29 store1 kernel: [2563127.111342] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967291
May 20 13:03:35 store1 kernel: [2563133.195889] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967290
May 20 13:03:41 store1 kernel: [2563139.284666] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967289
May 20 13:03:47 store1 kernel: [2563145.369499] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967288
May 20 13:03:53 store1 kernel: [2563151.454239] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967287
May 20 13:03:59 store1 kernel: [2563157.539055] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967286
May 20 13:04:05 store1 kernel: [2563163.623914] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967285
May 20 13:04:11 store1 kernel: [2563169.708710] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967284
May 20 13:04:18 store1 kernel: [2563175.793430] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967283
May 20 13:04:24 store1 kernel: [2563181.878285] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967282
May 20 13:04:30 store1 kernel: [2563187.963354] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967281
May 20 13:04:36 store1 kernel: [2563194.047901] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967280
May 20 13:04:42 store1 kernel: [2563200.132639] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967279
May 20 13:04:48 store1 kernel: [2563206.217530] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967278
May 20 13:04:54 store1 kernel: [2563212.302302] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967277
May 20 13:05:00 store1 kernel: [2563218.387028] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967276
May 20 13:05:06 store1 kernel: [2563224.469328] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967275
May 20 13:05:12 store1 kernel: [2563230.551079] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967274
May 20 13:05:18 store1 kernel: [2563236.632952] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967273
May 20 13:05:25 store1 kernel: [2563242.714601] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967272
May 20 13:05:31 store1 kernel: [2563248.796294] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967271
May 20 13:05:37 store1 kernel: [2563254.878074] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967270
May 20 13:05:43 store1 kernel: [2563260.959823] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967269
May 20 13:05:49 store1 kernel: [2563267.041604] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967268
May 20 13:05:55 store1 kernel: [2563273.123391] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967267
May 20 13:06:01 store1 kernel: [2563279.205064] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967266
May 20 13:06:07 store1 kernel: [2563285.286897] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967265
May 20 13:06:13 store1 kernel: [2563291.368702] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967264
May 20 13:06:19 store1 kernel: [2563297.450412] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967263
May 20 13:06:26 store1 kernel: [2563303.532186] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967262
May 20 13:06:32 store1 kernel: [2563309.613911] block drbd0: [drbd0_worker/4751] sock_sendmsg time expired, ko = 4294967261
May 20 13:06:33 store1 kernel: [2563311.130613] block drbd0: sock_recvmsg returned -104
May 20 13:06:33 store1 kernel: [2563311.190992] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
May 20 13:06:33 store1 kernel: [2563311.190998] block drbd0: asender terminated
May 20 13:06:33 store1 kernel: [2563311.191001] block drbd0: Terminating asender thread
May 20 13:06:33 store1 kernel: [2563311.191061] block drbd0: short read expecting header on sock: r=-512
May 20 13:06:35 store1 kernel: [2563312.970566] block drbd0: sock_sendmsg returned -104
May 20 13:06:35 store1 kernel: [2563313.372496] block drbd0: Creating new current UUID
May 20 13:06:35 store1 kernel: [2563313.437068] block drbd0: Connection closed
May 20 13:06:35 store1 kernel: [2563313.437081] block drbd0: conn( NetworkFailure -> Unconnected )
May 20 13:06:35 store1 kernel: [2563313.437096] block drbd0: receiver terminated
May 20 13:06:35 store1 kernel: [2563313.437099] block drbd0: Restarting receiver thread
May 20 13:06:35 store1 kernel: [2563313.437101] block drbd0: receiver (re)started
May 20 13:06:35 store1 kernel: [2563313.437117] block drbd0: conn( Unconnected -> WFConnection )
May 20 13:06:35 store1 kernel: [2563313.437907] block drbd0: connect failed, err = -104
May 20 13:06:36 store1 kernel: [2563313.795987] block drbd0: Handshake successful: Agreed network protocol version 91
--
May 20 13:03:15 store2 kernel: [45799.822766] block drbd0: PingAck did not arrive in time.
May 20 13:03:15 store2 kernel: [45799.886303] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
May 20 13:03:15 store2 kernel: [45799.886310] block drbd0: asender terminated
May 20 13:03:15 store2 kernel: [45799.886314] block drbd0: Terminating asender thread
May 20 13:03:15 store2 kernel: [45799.886334] block drbd0: short read receiving data: read 704 expected 4096
May 20 13:03:15 store2 kernel: [45799.886337] block drbd0: error receiving Data, l: 4120!
May 20 13:03:17 store2 kernel: [45801.699394] block drbd0: Connection closed
May 20 13:03:17 store2 kernel: [45801.699398] block drbd0: conn( NetworkFailure -> Unconnected )
May 20 13:03:17 store2 kernel: [45801.699401] block drbd0: receiver terminated
May 20 13:03:17 store2 kernel: [45801.699403] block drbd0: Restarting receiver thread
May 20 13:03:17 store2 kernel: [45801.699404] block drbd0: receiver (re)started
May 20 13:03:17 store2 kernel: [45801.699409] block drbd0: conn( Unconnected -> WFConnection )
May 20 13:06:36 store2 kernel: [46000.415054] block drbd0: Error receiving initial packet
May 20 13:06:36 store2 kernel: [46000.773161] block drbd0: Handshake successful: Agreed network protocol version 91


This is a rather serious fault, as a connection hung like this makes writing impossible to the device (I'm using proto C).
So I decided to set ko-count to 4, and adjusted the config of drbd0. It failed saying something about the local disk. I
don't have the console output anymore, only the syslog:


May 20 13:06:36 store1 kernel: [2563313.795987] block drbd0: Handshake successful: Agreed network protocol version 91
May 20 13:06:36 store1 kernel: [2563313.795999] block drbd0: conn( WFConnection -> WFReportParams )
May 20 13:06:36 store1 kernel: [2563313.834861] block drbd0: Starting asender thread (from drbd0_receiver [4791])
May 20 13:06:36 store1 kernel: [2563313.834915] block drbd0: data-integrity-alg: <not-used>
May 20 13:06:36 store1 kernel: [2563314.109040] block drbd0: drbd_sync_handshake:
May 20 13:06:36 store1 kernel: [2563314.109047] block drbd0: self 715899BC76796885:10F03B886C658C7B:2D8B933D7AA91E8B:0661832F339ADD0D bits:1758 flags:0
May 20 13:06:36 store1 kernel: [2563314.109052] block drbd0: peer 10F03B886C658C7A:0000000000000000:2D8B933D7AA91E8A:0661832F339ADD0D bits:0 flags:0
May 20 13:06:36 store1 kernel: [2563314.109056] block drbd0: uuid_compare()=1 by rule 70
May 20 13:06:36 store1 kernel: [2563314.109105] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
May 20 13:06:37 store1 kernel: [2563315.057964] block drbd0: conn( WFBitMapS -> Connected ) disk( UpToDate -> Diskless )
May 20 13:06:37 store1 kernel: [2563315.065299] block drbd0: unexpected cstate (Connected) in receive_bitmap
May 20 13:06:38 store1 kernel: [2563316.287950] block drbd0: disk( Diskless -> Attaching )
May 20 13:06:38 store1 kernel: [2563316.295535] block drbd0: Found 6 transactions (276 active extents) in activity log.
May 20 13:06:38 store1 kernel: [2563316.295538] block drbd0: Method to ensure write ordering: barrier
May 20 13:06:38 store1 kernel: [2563316.295543] block drbd0: max_segment_size ( = BIO size ) = 4096
May 20 13:06:38 store1 kernel: [2563316.295546] block drbd0: drbd_bm_resize called with capacity == 5469791136
May 20 13:06:39 store1 kernel: [2563317.001653] block drbd0: resync bitmap: bits=683723892 words=10683186
May 20 13:06:39 store1 kernel: [2563317.001657] block drbd0: size = 2608 GB (2734895568 KB)
May 20 13:06:39 store1 kernel: [2563317.001677] block drbd0: Writing the whole bitmap, size changed
May 20 13:06:39 store1 kernel: [2563317.259797] block drbd0: meta data flush failed with status -95, disabling md-flushes
May 20 13:06:39 store1 kernel: [2563317.355539] block drbd0: 1058 GB (277286209 bits) marked out-of-sync by on disk bit-map.
May 20 13:06:39 store1 kernel: [2563317.517842] block drbd0: recounting of set bits took additional 13 jiffies
May 20 13:06:39 store1 kernel: [2563317.517846] block drbd0: 1058 GB (277286209 bits) marked out-of-sync by on disk bit-map.
May 20 13:06:39 store1 kernel: [2563317.518001] block drbd0: Marked additional 0 KB as out-of-sync based on AL.
May 20 13:06:39 store1 kernel: [2563317.518015] block drbd0: disk( Attaching -> Negotiating )
May 20 13:06:40 store1 kernel: [2563317.540541] block drbd0: Connection lost while negotiating, no data!
May 20 13:06:41 store1 kernel: [2563319.270538] block drbd0: drbd_sync_handshake:
May 20 13:06:41 store1 kernel: [2563319.270543] block drbd0: self 715899BC76796885:10F03B886C658C7B:2D8B933D7AA91E8B:0661832F339ADD0D bits:277286209 flags:0
May 20 13:06:41 store1 kernel: [2563319.270547] block drbd0: peer BF8DB9AF5095E612:10F03B886C658C7A:2D8B933D7AA91E8A:0661832F339ADD0D bits:277287618 flags:0
May 20 13:06:41 store1 kernel: [2563319.270550] block drbd0: uuid_compare()=100 by rule 90
May 20 13:06:41 store1 kernel: [2563319.270552] block drbd0: Split-Brain detected, dropping connection!


For some reason, this node (the primary) became 'Diskless', and it's not possible to attach the disk anymore.

The secondary node logged this:

May 20 13:06:36 store2 kernel: [46000.773170] block drbd0: conn( WFConnection -> WFReportParams )
May 20 13:06:36 store2 kernel: [46000.773191] block drbd0: Starting asender thread (from drbd0_receiver [3964])
May 20 13:06:36 store2 kernel: [46000.842394] block drbd0: data-integrity-alg: <not-used>
May 20 13:06:36 store2 kernel: [46000.842406] block drbd0: Considerable difference in lower level device sizes: 7031032704s vs. 5469791136s
May 20 13:06:36 store2 kernel: [46000.842416] block drbd0: drbd_sync_handshake:
May 20 13:06:36 store2 kernel: [46000.842420] block drbd0: self 10F03B886C658C7A:0000000000000000:2D8B933D7AA91E8A:0661832F339ADD0D bits:0 flags:0
May 20 13:06:36 store2 kernel: [46000.842424] block drbd0: peer 715899BC76796885:10F03B886C658C7B:2D8B933D7AA91E8B:0661832F339ADD0D bits:1758 flags:0
May 20 13:06:36 store2 kernel: [46000.842428] block drbd0: uuid_compare()=-1 by rule 50
May 20 13:06:36 store2 kernel: [46000.842434] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 20 13:06:38 store2 kernel: [46002.033671] block drbd0: conn( WFBitMapT -> WFSyncUUID )
May 20 13:06:38 store2 kernel: [46002.033684] block drbd0: Resync aborted.
May 20 13:06:38 store2 kernel: [46002.033686] block drbd0: conn( WFSyncUUID -> Connected ) pdsk( UpToDate -> Diskless )
May 20 13:06:38 store2 kernel: [46002.033783] block drbd0: Creating new current UUID
May 20 13:06:40 store2 kernel: [46004.495638] block drbd0: Considerable difference in lower level device sizes: 7031032704s vs. 5469791136s
May 20 13:06:40 store2 kernel: [46004.495643] block drbd0: Peer sets u_size to 0 sectors
May 20 13:06:40 store2 kernel: [46004.495648] block drbd0: drbd_bm_resize called with capacity == 5469791136
May 20 13:06:40 store2 kernel: [46004.502587] block drbd0: resync bitmap: bits=683723892 words=10683186
May 20 13:06:40 store2 kernel: [46004.502592] block drbd0: size = 2608 GB (2734895568 KB)
May 20 13:06:40 store2 kernel: [46004.502609] block drbd0: Writing the whole bitmap, size changed
May 20 13:06:41 store2 kernel: [46005.074682] block drbd0: 1058 GB (277287618 bits) marked out-of-sync by on disk bit-map.
May 20 13:06:41 store2 kernel: [46005.075234] block drbd0: real peer disk state = Consistent
May 20 13:06:41 store2 kernel: [46005.075236] block drbd0: drbd_sync_handshake:
May 20 13:06:41 store2 kernel: [46005.075239] block drbd0: self BF8DB9AF5095E612:10F03B886C658C7A:2D8B933D7AA91E8A:0661832F339ADD0D bits:1763 flags:0
May 20 13:06:41 store2 kernel: [46005.075243] block drbd0: peer 715899BC76796885:10F03B886C658C7B:2D8B933D7AA91E8B:0661832F339ADD0D bits:277286209 flags:2
May 20 13:06:41 store2 kernel: [46005.075246] block drbd0: uuid_compare()=100 by rule 90
May 20 13:06:41 store2 kernel: [46005.075248] block drbd0: Split-Brain detected, dropping connection!


Summary of the above:

1a, Idle/new sessions time out with no apparent reason, before ping-int+ping-timeout seconds of inactivity (this happens all the time, easy to reproduce)
1b, When trying to reconnect, drbd opens a lot of sockets - this had no negative side effects so far, but looks strange
2, After the secondary closes the connection, the primary still tries to communicate with it - the timeout mechanism fails again (observed only a few times)
3, It's possible to lose the primary disk and some transactions on it after a reconnect (happened only once)





Regards,

Matyas Koszik




More information about the drbd-user mailing list