[DRBD-user] Question: how to avoid full resync

Eli Dorfman (Voltaire) dorfman.eli at gmail.com
Tue May 19 13:39:24 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars Ellenberg wrote:
> On Mon, May 18, 2009 at 07:03:03PM +0300, Eli Dorfman (Voltaire) wrote:
>>>>>>>>>> Assuming a setup of 2 nodes primary A and secondary B.
>>>>>>>>>> After primary node A reboot and B became primary, is there a way to avoid full resync
>>>>>>>>>> from B to A?
>>>>>>>>> Uhm, well, yes: just do nothing and let DRBD do its job ;)
>>>>>>>>> Bitmap based resync is the "normal" resync operation.
>>>>>>>>>
>>>>>>>> After A was rebooted and B had almost no changes - 
>>>>>>>> yet when A goes up again it seems that B (drbd) performs full resync.  
>>>>>>> what makes you think so?
>>>>>> That's what I see in /proc/drbd.
>>>>>> we have a partition of 20GB and from the link speed and the time till process is completed,
>>>>>> it seems that drbd performs full resync.
>>>>>>
>>>>>>>> Is this correct?
>>>>>>> it would not be expected.
>>>>>> So what could be the reason for this full resync?
>>>>> I still doubt the full sync, though.
>>>>> care to show logs?
>>>>>
>>>> After rebooting node A these are the messages on B:
>>>>
>>>> May 18 19:12:23 optimus2 kernel: drbd0: PingAck did not arrive in time.
>>>> May 18 19:12:23 optimus2 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>>>> May 18 19:12:23 optimus2 kernel: drbd0: asender terminated
>>>> May 18 19:12:23 optimus2 kernel: drbd0: Terminating asender thread
>>>> May 18 19:12:23 optimus2 kernel: drbd0: short read expecting header on sock: r=-512
>>>> May 18 19:12:23 optimus2 kernel: drbd0: Writing meta data super block now.
>>>> May 18 19:12:23 optimus2 kernel: drbd0: tl_clear()
>>>> May 18 19:12:23 optimus2 kernel: drbd0: Connection closed
>>>> May 18 19:12:23 optimus2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
>>>> May 18 19:12:23 optimus2 kernel: drbd0: receiver terminated
>>>> May 18 19:12:23 optimus2 kernel: drbd0: receiver (re)started
>>>> May 18 19:12:23 optimus2 kernel: drbd0: conn( Unconnected -> WFConnection )
>>>> May 18 19:13:16 optimus2 ResourceManager[30674]: [30685]: info: Acquiring resource group: optimus2 172.30.3.99/24/eth0/172.30.3.255 drbddisk::ufmdb Filesystem::/dev/drbd0::/opt/ufm/files/::ext3 mysqld ufmd
>>>> May 18 19:13:16 optimus2 kernel: drbd0: role( Secondary -> Primary )
>>>> May 18 19:13:16 optimus2 kernel: drbd0: Writing meta data super block now.
>>>> May 18 19:13:16 optimus2 kernel: drbd0: Creating new current UUID
>>>> May 18 19:13:16 optimus2 kernel: drbd0: Writing meta data super block now.
>>>> May 18 19:13:16 optimus2 ResourceManager[30674]: [30965]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/ufm/files/ ext3 start
>>>> May 18 19:13:16 optimus2 Filesystem[30978]: [31008]: INFO: Running start for /dev/drbd0 on /opt/ufm/files
>>>> May 18 19:13:16 optimus2 kernel: EXT3 FS on drbd0, internal journal
>>>> May 18 19:13:56 optimus2 kernel: drbd0: Handshake successful: Agreed network protocol version 88
>>>> May 18 19:13:56 optimus2 kernel: drbd0: conn( WFConnection -> WFReportParams )
>>>> May 18 19:13:56 optimus2 kernel: drbd0: Starting asender thread (from drbd0_receiver [22112])
>>>> May 18 19:13:56 optimus2 kernel: drbd0: data-integrity-alg: <not-used>
>>>> May 18 19:13:56 optimus2 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
>>>> May 18 19:13:56 optimus2 kernel: drbd0: Writing meta data super block now.
>>>> May 18 19:13:56 optimus2 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
>>>> May 18 19:13:56 optimus2 kernel: drbd0: Began resync as SyncSource (will sync 20482176 KB [5120544 bits set]).
>>>> May 18 19:13:56 optimus2 kernel: drbd0: Writing meta data super block now.
>>>>
>>>>
>>>> [root at optimus2 ~]# cat /proc/drbd
>>>> version: 8.2.6 (api:88/proto:86-88)
>>>> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-x8664-build, 2008-06-21 08:48:13
>>>>  0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
>>>>     ns:2244236 nr:482168 dw:483008 dr:2256073 al:18 bm:136 lo:3 pe:5 ua:253 ap:0 oos:18238236
>>>>         [=>..................] sync'ed: 11.0% (17810/20002)M
>>>>         finish: 0:24:30 speed: 12,320 (11,504) K/sec
>>>>
>>>>
>>>> It seems as if drbd performs full resync of node A - why?
>>>> The link is 100 Mb so drbd is actually using the maximal bandwidth.
>>> for some reason (which only you can determine, as you know what happened),
>>> "Began resync as SyncSource (will sync 20482176 KB [5120544 bits set])"
>>> that much bits are set, so that much will be synced.
>>>
>>> either you modified that much within 40 seconds (highly unlikely),
>>> or you tried something "interessting", and it did not work out.
>>>
>>> have you been trying to outsmart DRBD?
>>>
>>> what exactly is the sequence of events to reproduce this?
>>> something "interessting" in the logs on the other node?
>> We only rebooted node A (reboot -f)
>> These are the messages from node A
>>
>> May 18 17:14:17 shay2 kernel: drbd0: disk( Diskless -> Attaching )
>> May 18 17:14:17 shay2 kernel: drbd0: Starting worker thread (from cqueue/3 [155])
>> May 18 17:14:17 shay2 kernel: drbd0: Found 4 transactions (34 active extents) in activity log.
>> May 18 17:14:17 shay2 kernel: drbd0: max_segment_size ( = BIO size ) = 32768
>> May 18 17:14:17 shay2 kernel: drbd0: drbd_bm_resize called with capacity == 40964352
>> May 18 17:14:17 shay2 kernel: drbd0: resync bitmap: bits=5120544 words=80009
>> May 18 17:14:17 shay2 kernel: drbd0: size = 20 GB (20482176 KB)
>> May 18 17:14:17 shay2 kernel: drbd0: reading of bitmap took 15 jiffies
>> May 18 17:14:17 shay2 kernel: drbd0: recounting of set bits took additional 2 jiffies
>> May 18 17:14:17 shay2 kernel: drbd0: 20 GB (5120544 bits) marked out-of-sync by on disk bit-map.
> 
> there.  20 GB marked out of sync by on disk bitmap.
> this node's bitmap tells it to sync it all.
> 
> what do you (well, or your system and setup) do after reboot?
> what did you do before reboot?
> what did you do since first creation of the drbd,
> what is the history of this device?
> which "procedure" did you follow to "set it all up"?
> 
> I cannot tell you yet _why_ the bitmap was "all dirty".
> but it was.
> 

It seems that after drbd is installed we run a procedure to skip initial sync node A and B
since we know that both partitions are empty.
i assume this is the problem.
after that we install file system on the master and database files.

Can you tell me if there safe way to perform fast initial sync when both partitions are clear?

Thanks,
Eli




More information about the drbd-user mailing list