Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2004-05-19 20:49:37 +0400 \ Eugene Crosser: > On Wed, 2004-05-19 at 19:13, Lars Ellenberg wrote: > > > please try this patch, I'd like to see if that has some impact > > before I do a check in. > > Things got broken. On connection, it synched fine and both nodes said > "Resync done" at the same time. But switchover did not work. Secondary > stayed in > > 0: cs:Connected st:Secondary/Primary ld:Consistent > ns:0 nr:1610024 dw:1610024 dr:810004 al:0 bm:1768 lo:0 pe:0 ua:0 ap:0 > I edited the log slightly, nfsa1 and nfsa2 interleaving, nfsa2 marked. > 40:49 nfsa1 kernel: drbd0: size = 214165504 KB > 40:50 nfsa1 kernel: drbd0: 0 KB marked out-of-sync by on disk bit-map. > 40:50 nfsa1 kernel: drbd0: Found 6 transactions (324 active extents) in activity log. > 40:50 nfsa1 kernel: drbd0: drbdsetup [143]: cstate Unconfigured(0) --> StandAlone(1) > 40:50 nfsa1 kernel: drbd0: drbdsetup [145]: cstate StandAlone(1) --> Unconnected(2) > 40:50 nfsa1 kernel: drbd0: drbd0_receiver [146]: cstate Unconnected(2) --> WFConnection(6) > 40:50 nfsa1 kernel: drbd0: drbd0_receiver [146]: cstate WFConnection(6) --> WFReportParams(7) > | 40:50 nfsa2 kernel: drbd0: drbd0_receiver [146]: cstate WFConnection(6) --> WFReportParams(7) > | 40:50 nfsa2 kernel: drbd0: Connection established. > 40:50 nfsa1 kernel: drbd0: Connection established. > | 40:50 nfsa2 kernel: drbd0: I am(P): 1:00000001:00000001:0000729c:00000030:10 > 40:50 nfsa1 kernel: drbd0: I am(S): 1:00000001:00000001:0000729b:0000002e:01 > | 40:50 nfsa2 kernel: drbd0: Peer(S): 1:00000001:00000001:0000729b:0000002e:01 > 40:50 nfsa1 kernel: drbd0: Peer(P): 1:00000001:00000001:0000729c:00000030:10 > 40:50 nfsa1 kernel: drbd0: drbd0_receiver [146]: cstate WFReportParams(7) --> WFBitMapT(12) > | 40:50 nfsa2 kernel: drbd0: drbd0_receiver [146]: cstate WFReportParams(7) --> WFBitMapS(11) > 40:50 nfsa1 kernel: drbd0: drbd0_receiver [146]: cstate WFBitMapT(12) --> SyncTarget(14) > 40:50 nfsa1 kernel: drbd0: Resync started as target (need to sync 1422984 KB). > | 40:50 nfsa2 kernel: drbd0: drbd0_receiver [146]: cstate WFBitMapS(11) --> SyncSource(13) > | 40:50 nfsa2 kernel: drbd0: Resync started as source (need to sync 1422984 KB). > 40:51 nfsa1 kernel: process `snmpd' is using obsolete setsockopt SO_BSDCOMPAT > 41:32 nfsa1 kernel: drbd0: Resync done (total 42 sec; 33880 K/sec) > | 41:32 nfsa2 kernel: drbd0: Resync done (total 42 sec; 33880 K/sec) > | 41:32 nfsa2 kernel: drbd0: drbd0_worker [144]: cstate SyncSource(13) --> Connected(8) > 41:32 nfsa1 kernel: drbd0: drbd0_worker [144]: cstate SyncTarget(14) --> Connected(8) OK. failover: > 43:35 nfsa1 heartbeat: ERROR: Could not locate obtain hardware address for eth0 > 43:35 nfsa1 heartbeat: ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr I am missing the heartbeat ERROR from the drbddisk script here? Is it possible that it was not called? Or that heartbeat thinks it did "succeed" even though it failed? > 43:36 nfsa1 heartbeat: ERROR: Return code 1 from /etc/ha.d/resource.d/filesys I think the problem is that HeartBeat recognizes "immediately" that the other node is down, whereas DRBD needs from 10 to 20 seconds (more exactly: worst case from one to two timeout intervals plus one ping interval) to finaly drop the tcp connection. during this time there should be no access on the secondary yet, since the typical first access is "mount", and that depends on "drbddisk start" which is "drbdadm primary", which will not succeed, while DRBD still thinks it is connected, and the other node is Primary. either one of the heartbeat resource scripts did not return the correct answer / exit code, or was not called at all (there is no heartbeat message about drbddisk above), or the heartbeat+drbd interaction logic needs to be improved here. This is not a DRBD internal problem, but an interaction problem with heartbeat. The asserts below are IO requests on a Secondary device. at the moment, READ access will still get through (though noisy, and you won't be able to mount it). WRITE will receive IO errors. > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 last message repeated 15 times > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_main.c:1145 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 last message repeated 14 times > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_main.c:1145 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 last message repeated 31 times > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_main.c:1145 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 last message repeated 31 times > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_main.c:1145 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 last message repeated 31 times > 43:37 nfsa1 kernel: in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_main.c:1145 > 43:37 nfsa1 kernel: drbd0: ASSERT( mdev->state == Primary ) in drivers/block/drbd/drbd_req-2.4.c:154 > 43:37 nfsa1 last message repeated 6 times ... loads of these. Lars Ellenberg