Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Just some analysis from myself: Michael Paesold wrote: > Logs from primary/sync source: (xen03) > ------------------------------------------------------- > 14:53:18 drbd0: resync bitmap: bits=47492149 words=1484130 > 14:53:18 drbd0: size = 181 GB (189968593 KB) > 14:53:20 drbd0: 189968596 KB now marked out-of-sync by on disk bit-map. > 14:53:20 drbd0: Assuming that all blocks are out of sync (aka FullSync) > 14:53:22 drbd0: 189968596 KB now marked out-of-sync by on disk bit-map. > 14:53:22 drbd0: Couldn't start thread (-4) This can only come from drbd/drbd_main.c. It is a failure of pid = kernel_thread(drbd_thread_setup, (void *) thi, CLONE_FS); in drbd_thread_start. I ain't no kernel hacker, so I don't really know what to think of this. Perhaps it's here where this all started? A problem with the kernel? > 14:53:22 drbd0: drbdsetup [3461]: cstate Unconfigured --> StandAlone > 14:54:02 drbd0: drbdsetup [3479]: cstate StandAlone --> Unconnected > 14:54:02 drbd0: drbdsetup [3479]: cstate Unconnected --> StandAlone > 14:54:02 drbd0: drbdsetup [3479]: cstate StandAlone --> Unconfigured > 14:54:02 drbd0: _drbd_thread_stop: (thi->task == NULL) in > /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_main.c:581 This is a failed D_ASSERT in _drbd_thread_stop, well... > 14:54:03 drbd0: resync bitmap: bits=47492149 words=1484130 > 14:54:03 drbd0: size = 181 GB (189968593 KB) > 14:54:05 drbd0: 181 GB marked out-of-sync by on disk bit-map. > 14:54:05 drbd0: No usable activity log found. > 14:54:05 drbd0: drbdsetup [3494]: cstate Unconfigured --> StandAlone > 14:54:05 drbd1: Creating state block > 14:54:05 drbd1: resync bitmap: bits=47492149 words=1484130 > 14:54:05 drbd1: size = 181 GB (189968593 KB) > 14:54:07 drbd1: 189968596 KB now marked out-of-sync by on disk bit-map. > 14:54:07 drbd1: Assuming that all blocks are out of sync (aka FullSync) > 14:54:09 drbd1: 189968596 KB now marked out-of-sync by on disk bit-map. > 14:54:09 drbd1: drbdsetup [3495]: cstate Unconfigured --> StandAlone > 14:54:09 drbd0: drbdsetup [3499]: cstate StandAlone --> Unconnected > 14:54:09 drbd0: drbd0_receiver [3500]: cstate Unconnected --> WFConnection > 14:54:09 drbd1: drbdsetup [3501]: cstate StandAlone --> Unconnected > 14:54:09 drbd1: drbd1_receiver [3502]: cstate Unconnected --> WFConnection > 14:54:34 drbd0: drbd0_receiver [3500]: cstate WFConnection --> > WFReportParams > 14:54:34 drbd0: Handshake successful: DRBD Network Protocol version 74 > 14:54:34 drbd0: Connection established. > 14:54:34 drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00 > 14:54:34 drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00 > 14:54:34 drbd0: drbd0_receiver [3500]: cstate WFReportParams --> Connected > 14:54:34 drbd0: I am inconsistent, but there is no sync? BOTH nodes > inconsistent! > 14:54:34 drbd0: Secondary/Unknown --> Secondary/Secondary > 14:54:34 drbd1: drbd1_receiver [3502]: cstate WFConnection --> > WFReportParams > 14:54:34 drbd1: Handshake successful: DRBD Network Protocol version 74 > 14:54:34 drbd1: Connection established. > 14:54:34 drbd1: I am(S): 0:00000001:00000001:00000001:00000001:00 > 14:54:34 drbd1: Peer(S): 0:00000001:00000001:00000001:00000001:00 > 14:54:34 drbd1: drbd1_receiver [3502]: cstate WFReportParams --> Connected > 14:54:34 drbd1: I am inconsistent, but there is no sync? BOTH nodes > inconsistent! > 14:54:34 drbd1: Secondary/Unknown --> Secondary/Secondary > 14:55:05 drbd0: Secondary/Secondary --> Primary/Secondary > 14:55:05 drbd0: I am(P): 1:00000002:00000001:00000001:00000001:11 > 14:55:05 drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:01 > 14:55:05 drbd0: drbd0_receiver [3500]: cstate Connected --> WFBitMapS > 14:55:05 drbd1: Secondary/Secondary --> Primary/Secondary > 14:55:05 drbd1: I am(P): 1:00000002:00000001:00000001:00000001:11 > 14:55:05 drbd1: Peer(S): 0:00000001:00000001:00000001:00000001:01 > 14:55:05 drbd1: drbd1_receiver [3502]: cstate Connected --> WFBitMapS > 14:55:05 drbd1: drbd1_receiver [3502]: cstate WFBitMapS --> SyncSource > 14:55:05 drbd1: Resync started as SyncSource (need to sync 189968596 KB > [47492149 bits set]). > 14:55:06 drbd0: drbd0_receiver [3500]: cstate WFBitMapS --> SyncSource > 14:55:06 drbd0: Resync started as SyncSource (need to sync 189968596 KB > [47492149 bits set]). > 14:59:24 drbd0: drbdsetup [3587]: cstate SyncSource --> Unconnected > 14:59:24 drbd0: drbd0_receiver [3500]: cstate Unconnected --> BrokenPipe > 14:59:24 drbd0: short read expecting header on sock: r=-512 > 14:59:24 drbd0: _drbd_thread_stop: (thi->task == NULL) in > /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_main.c:581 > 14:59:24 drbd0: asender terminated Here again, something wrong with the kernel thread... > 14:59:24 drbd0: _drbd_thread_stop: (thi->task == NULL) in > /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_main.c:581 > 14:59:24 drbd0: ASSERT( mdev->ee_in_use == 0 ) in > /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_receiver.c:1881 > 14:59:24 drbd0: unacked_cnt = 1536 > 14:59:24 drbd0: drbd0_receiver [3500]: cstate BrokenPipe --> StandAlone > 14:59:24 drbd0: Connection lost. > 14:59:24 drbd0: receiver terminated > 14:59:24 drbd0: drbdsetup [3587]: cstate StandAlone --> StandAlone > 14:59:35 drbd0: drbdsetup [3602]: cstate StandAlone --> Unconnected > 14:59:35 drbd0: drbd0_receiver [3603]: cstate Unconnected --> WFConnection > 14:59:35 drbd0: drbd0_receiver [3603]: cstate WFConnection --> > WFReportParams > 14:59:35 drbd0: Handshake successful: DRBD Network Protocol version 74 > 14:59:35 drbd0: Connection established. > 14:59:35 drbd0: I am(P): 1:00000002:00000001:00000001:00000001:10 > 14:59:35 drbd0: Peer(S): 0:00000002:00000001:00000001:00000001:01 > 14:59:35 drbd0: drbd0_receiver [3603]: cstate WFReportParams --> WFBitMapS > 14:59:35 drbd0: Primary/Unknown --> Primary/Secondary > 14:59:35 drbd0: drbd0_receiver [3603]: cstate WFBitMapS --> SyncSource > 14:59:35 drbd0: Resync started as SyncSource (need to sync 189968596 KB > [47492149 bits set]). > 14:59:35 drbd0: resync_work.cb == w_resume_next_sg, should be > w_resync_inactive This is from drbd_start_resync, just below a // FIXME: this was a PARANOIA_BUG_ON, but it triggered! ?? My comments are probably not much of a help, but I tried. I will look if I can consistently reproduce the problem. Best Regards, Michael Paesold