[DRBD-user] Problem with new setup/0.7.16: stalled sync

Michael Paesold mpaesold at gmx.at
Tue Feb 21 09:28:14 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Just some analysis from myself:

Michael Paesold wrote:
> Logs from primary/sync source: (xen03)
> -------------------------------------------------------
> 14:53:18 drbd0: resync bitmap: bits=47492149 words=1484130
> 14:53:18 drbd0: size = 181 GB (189968593 KB)
> 14:53:20 drbd0: 189968596 KB now marked out-of-sync by on disk bit-map.
> 14:53:20 drbd0: Assuming that all blocks are out of sync (aka FullSync)
> 14:53:22 drbd0: 189968596 KB now marked out-of-sync by on disk bit-map.
> 14:53:22 drbd0: Couldn't start thread (-4)

This can only come from drbd/drbd_main.c. It is a failure of
pid = kernel_thread(drbd_thread_setup, (void *) thi, CLONE_FS); in 
drbd_thread_start.

I ain't no kernel hacker, so I don't really know what to think of this. 
Perhaps it's here where this all started? A problem with the kernel?


> 14:53:22 drbd0: drbdsetup [3461]: cstate Unconfigured --> StandAlone
> 14:54:02 drbd0: drbdsetup [3479]: cstate StandAlone --> Unconnected
> 14:54:02 drbd0: drbdsetup [3479]: cstate Unconnected --> StandAlone
> 14:54:02 drbd0: drbdsetup [3479]: cstate StandAlone --> Unconfigured
> 14:54:02 drbd0: _drbd_thread_stop: (thi->task == NULL) in 
> /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_main.c:581

This is a failed D_ASSERT in _drbd_thread_stop, well...

> 14:54:03 drbd0: resync bitmap: bits=47492149 words=1484130
> 14:54:03 drbd0: size = 181 GB (189968593 KB)
> 14:54:05 drbd0: 181 GB marked out-of-sync by on disk bit-map.
> 14:54:05 drbd0: No usable activity log found.
> 14:54:05 drbd0: drbdsetup [3494]: cstate Unconfigured --> StandAlone
> 14:54:05 drbd1: Creating state block
> 14:54:05 drbd1: resync bitmap: bits=47492149 words=1484130
> 14:54:05 drbd1: size = 181 GB (189968593 KB)
> 14:54:07 drbd1: 189968596 KB now marked out-of-sync by on disk bit-map.
> 14:54:07 drbd1: Assuming that all blocks are out of sync (aka FullSync)
> 14:54:09 drbd1: 189968596 KB now marked out-of-sync by on disk bit-map.
> 14:54:09 drbd1: drbdsetup [3495]: cstate Unconfigured --> StandAlone
> 14:54:09 drbd0: drbdsetup [3499]: cstate StandAlone --> Unconnected
> 14:54:09 drbd0: drbd0_receiver [3500]: cstate Unconnected --> WFConnection
> 14:54:09 drbd1: drbdsetup [3501]: cstate StandAlone --> Unconnected
> 14:54:09 drbd1: drbd1_receiver [3502]: cstate Unconnected --> WFConnection
> 14:54:34 drbd0: drbd0_receiver [3500]: cstate WFConnection --> 
> WFReportParams
> 14:54:34 drbd0: Handshake successful: DRBD Network Protocol version 74
> 14:54:34 drbd0: Connection established.
> 14:54:34 drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00
> 14:54:34 drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00
> 14:54:34 drbd0: drbd0_receiver [3500]: cstate WFReportParams --> Connected
> 14:54:34 drbd0: I am inconsistent, but there is no sync? BOTH nodes 
> inconsistent!
> 14:54:34 drbd0: Secondary/Unknown --> Secondary/Secondary
> 14:54:34 drbd1: drbd1_receiver [3502]: cstate WFConnection --> 
> WFReportParams
> 14:54:34 drbd1: Handshake successful: DRBD Network Protocol version 74
> 14:54:34 drbd1: Connection established.
> 14:54:34 drbd1: I am(S): 0:00000001:00000001:00000001:00000001:00
> 14:54:34 drbd1: Peer(S): 0:00000001:00000001:00000001:00000001:00
> 14:54:34 drbd1: drbd1_receiver [3502]: cstate WFReportParams --> Connected
> 14:54:34 drbd1: I am inconsistent, but there is no sync? BOTH nodes 
> inconsistent!
> 14:54:34 drbd1: Secondary/Unknown --> Secondary/Secondary
> 14:55:05 drbd0: Secondary/Secondary --> Primary/Secondary
> 14:55:05 drbd0: I am(P): 1:00000002:00000001:00000001:00000001:11
> 14:55:05 drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:01
> 14:55:05 drbd0: drbd0_receiver [3500]: cstate Connected --> WFBitMapS
> 14:55:05 drbd1: Secondary/Secondary --> Primary/Secondary
> 14:55:05 drbd1: I am(P): 1:00000002:00000001:00000001:00000001:11
> 14:55:05 drbd1: Peer(S): 0:00000001:00000001:00000001:00000001:01
> 14:55:05 drbd1: drbd1_receiver [3502]: cstate Connected --> WFBitMapS
> 14:55:05 drbd1: drbd1_receiver [3502]: cstate WFBitMapS --> SyncSource
> 14:55:05 drbd1: Resync started as SyncSource (need to sync 189968596 KB 
> [47492149 bits set]).
> 14:55:06 drbd0: drbd0_receiver [3500]: cstate WFBitMapS --> SyncSource
> 14:55:06 drbd0: Resync started as SyncSource (need to sync 189968596 KB 
> [47492149 bits set]).
> 14:59:24 drbd0: drbdsetup [3587]: cstate SyncSource --> Unconnected
> 14:59:24 drbd0: drbd0_receiver [3500]: cstate Unconnected --> BrokenPipe
> 14:59:24 drbd0: short read expecting header on sock: r=-512
> 14:59:24 drbd0: _drbd_thread_stop: (thi->task == NULL) in 
> /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_main.c:581
> 14:59:24 drbd0: asender terminated

Here again, something wrong with the kernel thread...

> 14:59:24 drbd0: _drbd_thread_stop: (thi->task == NULL) in 
> /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_main.c:581
> 14:59:24 drbd0: ASSERT( mdev->ee_in_use == 0 ) in 
> /home/mip/rpm/BUILD/drbd-0.7.16/drbd/drbd_receiver.c:1881
> 14:59:24 drbd0: unacked_cnt = 1536
> 14:59:24 drbd0: drbd0_receiver [3500]: cstate BrokenPipe --> StandAlone
> 14:59:24 drbd0: Connection lost.
> 14:59:24 drbd0: receiver terminated
> 14:59:24 drbd0: drbdsetup [3587]: cstate StandAlone --> StandAlone
> 14:59:35 drbd0: drbdsetup [3602]: cstate StandAlone --> Unconnected
> 14:59:35 drbd0: drbd0_receiver [3603]: cstate Unconnected --> WFConnection
> 14:59:35 drbd0: drbd0_receiver [3603]: cstate WFConnection --> 
> WFReportParams
> 14:59:35 drbd0: Handshake successful: DRBD Network Protocol version 74
> 14:59:35 drbd0: Connection established.
> 14:59:35 drbd0: I am(P): 1:00000002:00000001:00000001:00000001:10
> 14:59:35 drbd0: Peer(S): 0:00000002:00000001:00000001:00000001:01
> 14:59:35 drbd0: drbd0_receiver [3603]: cstate WFReportParams --> WFBitMapS
> 14:59:35 drbd0: Primary/Unknown --> Primary/Secondary
> 14:59:35 drbd0: drbd0_receiver [3603]: cstate WFBitMapS --> SyncSource
> 14:59:35 drbd0: Resync started as SyncSource (need to sync 189968596 KB 
> [47492149 bits set]).
> 14:59:35 drbd0: resync_work.cb == w_resume_next_sg, should be 
> w_resync_inactive

This is from drbd_start_resync, just below a
// FIXME: this was a PARANOIA_BUG_ON, but it triggered! ??


My comments are probably not much of a help, but I tried. I will look if I 
can consistently reproduce the problem.

Best Regards,
Michael Paesold 





More information about the drbd-user mailing list