[DRBD-user] Question: how to avoid full resync

Eli Dorfman (Voltaire) dorfman.eli at gmail.com
Mon May 18 18:03:03 CEST 2009


Lars Ellenberg wrote:
> On Mon, May 18, 2009 at 05:18:41PM +0300, Eli Dorfman (Voltaire) wrote:
>> Lars Ellenberg wrote:
>>> On Sun, May 17, 2009 at 05:12:50PM +0300, Eli Dorfman (Voltaire) wrote:
>>>> Lars Ellenberg wrote:
>>>>> On Thu, May 14, 2009 at 04:45:44PM +0300, Eli Dorfman (Voltaire) wrote:
>>>>>> Lars Ellenberg wrote:
>>>>>>> On Wed, May 13, 2009 at 06:35:43PM +0300, Eli Dorfman (Voltaire) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Assuming a setup of 2 nodes primary A and secondary B.
>>>>>>>> After primary node A reboot and B became primary, is there a way to avoid full resync
>>>>>>>> from B to A?
>>>>>>> Uhm, well, yes: just do nothing and let DRBD do its job ;)
>>>>>>> Bitmap based resync is the "normal" resync operation.
>>>>>>>
>>>>>> After A was rebooted and B had almost no changes - 
>>>>>> yet when A goes up again it seems that B (drbd) performs full resync.  
>>>>> what makes you think so?
>>>> That's what I see in /proc/drbd.
>>>> we have a partition of 20GB and from the link speed and the time till process is completed,
>>>> it seems that drbd performs full resync.
>>>>
>>>>>> Is this correct?
>>>>> it would not be expected.
>>>> So what could be the reason for this full resync?
>>> I still doubt the full sync, though.
>>> care to show logs?
>>>
>> After rebooting node A these are the messages on B:
>>
>> May 18 19:12:23 optimus2 kernel: drbd0: PingAck did not arrive in time.
>> May 18 19:12:23 optimus2 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> May 18 19:12:23 optimus2 kernel: drbd0: asender terminated
>> May 18 19:12:23 optimus2 kernel: drbd0: Terminating asender thread
>> May 18 19:12:23 optimus2 kernel: drbd0: short read expecting header on sock: r=-512
>> May 18 19:12:23 optimus2 kernel: drbd0: Writing meta data super block now.
>> May 18 19:12:23 optimus2 kernel: drbd0: tl_clear()
>> May 18 19:12:23 optimus2 kernel: drbd0: Connection closed
>> May 18 19:12:23 optimus2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
>> May 18 19:12:23 optimus2 kernel: drbd0: receiver terminated
>> May 18 19:12:23 optimus2 kernel: drbd0: receiver (re)started
>> May 18 19:12:23 optimus2 kernel: drbd0: conn( Unconnected -> WFConnection )
>> May 18 19:13:16 optimus2 ResourceManager[30674]: [30685]: info: Acquiring resource group: optimus2 172.30.3.99/24/eth0/172.30.3.255 drbddisk::ufmdb Filesystem::/dev/drbd0::/opt/ufm/files/::ext3 mysqld ufmd
>> May 18 19:13:16 optimus2 kernel: drbd0: role( Secondary -> Primary )
>> May 18 19:13:16 optimus2 kernel: drbd0: Writing meta data super block now.
>> May 18 19:13:16 optimus2 kernel: drbd0: Creating new current UUID
>> May 18 19:13:16 optimus2 kernel: drbd0: Writing meta data super block now.
>> May 18 19:13:16 optimus2 ResourceManager[30674]: [30965]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/ufm/files/ ext3 start
>> May 18 19:13:16 optimus2 Filesystem[30978]: [31008]: INFO: Running start for /dev/drbd0 on /opt/ufm/files
>> May 18 19:13:16 optimus2 kernel: EXT3 FS on drbd0, internal journal
>> May 18 19:13:56 optimus2 kernel: drbd0: Handshake successful: Agreed network protocol version 88
>> May 18 19:13:56 optimus2 kernel: drbd0: conn( WFConnection -> WFReportParams )
>> May 18 19:13:56 optimus2 kernel: drbd0: Starting asender thread (from drbd0_receiver [22112])
>> May 18 19:13:56 optimus2 kernel: drbd0: data-integrity-alg: <not-used>
>> May 18 19:13:56 optimus2 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
>> May 18 19:13:56 optimus2 kernel: drbd0: Writing meta data super block now.
>> May 18 19:13:56 optimus2 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
>> May 18 19:13:56 optimus2 kernel: drbd0: Began resync as SyncSource (will sync 20482176 KB [5120544 bits set]).
>> May 18 19:13:56 optimus2 kernel: drbd0: Writing meta data super block now.
>>
>>
>> [root at optimus2 ~]# cat /proc/drbd
>> version: 8.2.6 (api:88/proto:86-88)
>> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-x8664-build, 2008-06-21 08:48:13
>>  0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
>>     ns:2244236 nr:482168 dw:483008 dr:2256073 al:18 bm:136 lo:3 pe:5 ua:253 ap:0 oos:18238236
>>         [=>..................] sync'ed: 11.0% (17810/20002)M
>>         finish: 0:24:30 speed: 12,320 (11,504) K/sec
>>
>>
>> It seems as if drbd performs full resync of node A - why?
>> The link is 100 Mb so drbd is actually using the maximal bandwidth.
> 
> for some reason (which only you can determine, as you know what happened),
> "Began resync as SyncSource (will sync 20482176 KB [5120544 bits set])"
> that much bits are set, so that much will be synced.
> 
> either you modified that much within 40 seconds (highly unlikely),
> or you tried something "interessting", and it did not work out.
> 
> have you been trying to outsmart DRBD?
> 
> what exactly is the sequence of events to reproduce this?
> something "interessting" in the logs on the other node?

We only rebooted node A (reboot -f)
These are the messages from node A

May 18 17:14:17 shay2 kernel: drbd0: disk( Diskless -> Attaching )
May 18 17:14:17 shay2 kernel: drbd0: Starting worker thread (from cqueue/3 [155])
May 18 17:14:17 shay2 kernel: drbd0: Found 4 transactions (34 active extents) in activity log.
May 18 17:14:17 shay2 kernel: drbd0: max_segment_size ( = BIO size ) = 32768
May 18 17:14:17 shay2 kernel: drbd0: drbd_bm_resize called with capacity == 40964352
May 18 17:14:17 shay2 kernel: drbd0: resync bitmap: bits=5120544 words=80009
May 18 17:14:17 shay2 kernel: drbd0: size = 20 GB (20482176 KB)
May 18 17:14:17 shay2 kernel: drbd0: reading of bitmap took 15 jiffies
May 18 17:14:17 shay2 kernel: drbd0: recounting of set bits took additional 2 jiffies
May 18 17:14:17 shay2 kernel: drbd0: 20 GB (5120544 bits) marked out-of-sync by on disk bit-map.
May 18 17:14:17 shay2 kernel: drbd0: Marked additional 0 KB as out-of-sync based on AL.
May 18 17:14:17 shay2 kernel: drbd0: disk( Attaching -> UpToDate )
May 18 17:14:17 shay2 kernel: drbd0: Writing meta data super block now.
May 18 17:14:17 shay2 kernel: drbd0: conn( StandAlone -> Unconnected )
May 18 17:14:17 shay2 kernel: drbd0: Starting receiver thread (from drbd0_worker [5891])
May 18 17:14:17 shay2 kernel: drbd0: receiver (re)started
May 18 17:14:17 shay2 kernel: drbd0: conn( Unconnected -> WFConnection )
May 18 17:14:18 shay2 kernel: drbd0: Handshake successful: Agreed network protocol version 88
May 18 17:14:18 shay2 kernel: drbd0: conn( WFConnection -> WFReportParams )
May 18 17:14:18 shay2 kernel: drbd0: Starting asender thread (from drbd0_receiver [5900])
May 18 17:14:18 shay2 kernel: drbd0: data-integrity-alg: <not-used>
May 18 17:14:18 shay2 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 18 17:14:18 shay2 kernel: drbd0: Writing meta data super block now.
May 18 17:14:18 shay2 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
May 18 17:14:18 shay2 kernel: drbd0: helper command: /sbin/drbdadm before-resync-target
May 18 17:14:18 shay2 kernel: drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
May 18 17:14:18 shay2 kernel: drbd0: Began resync as SyncTarget (will sync 20482176 KB [5120544 bits set]).
May 18 17:14:18 shay2 kernel: drbd0: Writing meta data super block now.
May 18 17:21:19 shay2 kernel: drbd0: peer( Primary -> Secondary )

Thanks,
Eli


More information about the drbd-user mailing list