[DRBD-user] DRBD sync messages every ten seconds

William Seligman seligman at nevis.columbia.edu
Wed Mar 9 16:49:08 CET 2011


On 3/4/11 4:58 PM, William Seligman wrote:
> On 3/4/11 12:38 PM, William Seligman wrote:
>> I've RTFM'ed and google'd on this problem. Now I ask the experts.
> 
> Now that I've joined this list, I looked at the archives directly. I see that
> Cory Coager reported the same problem:
> 
> http://lists.linbit.com/pipermail/drbd-user/2011-March/015735.html
> 
> Lars Ellenberg suggested that the problem was due to a bad NIC. Maybe... but
> what are the odds that two different systems have a bad NIC?

I just did a little archeology. I haven't always experienced these regular DRBD
sync error messages. They began when I made two changes to my configuration file:

- I switched from "Protocol C" to "Protocol A".
- I added "net { ping-timeout 100; }"

Are either of these changes likely to cause problems?

My next step would normally be to reverse those changes, but these are
production systems and it's hard for me to perform tests.

>> Setup: Two systems; hypatia is primary, orestes is secondary. OS is Scientific
>> Linux 5.5: kernel 2.6.18-194.26.1.el5xen; DRBD version drbd-8.3.8.1-30.el5.
>>
>> Each has two partitions that are used for separate DRBD devices: /dev/md0
>> (software RAID1) and /dev/sdd2. On both systems:
>>
>> partition /dev/md0 => device drbd1
>> partition /dev/sdd2 => device drbd2
>>
>> The DRBD traffic goes over a single Ethernet cable that connects the two systems.
>>
>> For drbd1, the control heirarchy is Corosync->DRBD->LVM->Xen.
>> For drbd2, the control is Corosync->DRBD->Just mount the thing.
>>
>> The complicated one is drbd1, but it seems to work just fine. The problem
>> appears to be with drbd2, which doesn't do much of anything; it's a work/backup
>> directory which I use to take infrequent (~two months) snapshots of the virtual
>> machines on drbd1.
>>
>> Every ten seconds, the error messages at the end of this post appear in the log
>> of the primary system and there are similar lines on the secondary system. It
>> seems that drbd2 is losing its connection, re-establishing, and doing a re-sync.
>>
>> Everything works, most of the time. But once every few weeks there's enough of a
>> delay that Corosync takes notice and STONITHs one of the systems, which is a big
>> pain.
>>
>> I've tried:
>> - switching from Protocol C to Protocol A
>> - setting "net {ping-timeout 100;}"
>> - throttling the connection by "syncer {rate 10M;}" (used to be 100M)
>>
>> Any ideas?
>>
>> Mar  4 12:26:25 hypatia kernel: block drbd2: meta connection shut down by peer.
>> Mar  4 12:26:25 hypatia kernel: block drbd2: peer( Secondary -> Unknown ) conn(
>> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: asender terminated
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Terminating asender thread
>> Mar  4 12:26:25 hypatia kernel: block drbd2: sock was shut down by peer
>> Mar  4 12:26:25 hypatia kernel: block drbd2: short read expecting header on
>> sock: r=0
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Creating new current UUID
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Connection closed
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( NetworkFailure -> Unconnected )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: receiver terminated
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Restarting receiver thread
>> Mar  4 12:26:25 hypatia kernel: block drbd2: receiver (re)started
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( Unconnected -> WFConnection )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Handshake successful: Agreed
>> network protocol version 94
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( WFConnection -> WFReportParams )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Starting asender thread (from
>> drbd2_receiver [7920])
>> Mar  4 12:26:25 hypatia kernel: block drbd2: data-integrity-alg: <not-used>
>> Mar  4 12:26:25 hypatia kernel: block drbd2: drbd_sync_handshake:
>> Mar  4 12:26:25 hypatia kernel: block drbd2: self
>> C4884637D2C418DF:922772A0478F5E1F:2DE51139CD7C3DF7:EB27F748FC21DC65 bits:0 flags:0
>> Mar  4 12:26:25 hypatia kernel: block drbd2: peer
>> 922772A0478F5E1E:0000000000000000:2DE51139CD7C3DF6:EB27F748FC21DC65 bits:0 flags:0
>> Mar  4 12:26:25 hypatia kernel: block drbd2: uuid_compare()=1 by rule 70
>> Mar  4 12:26:25 hypatia kernel: block drbd2: peer( Unknown -> Secondary ) conn(
>> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( WFBitMapS -> SyncSource )
>> pdsk( UpToDate -> Inconsistent )
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Began resync as SyncSource (will
>> sync 0 KB [0 bits set]).
>> Mar  4 12:26:25 hypatia kernel: block drbd2: Resync done (total 1 sec; paused 0
>> sec; 0 K/sec)
>> Mar  4 12:26:25 hypatia kernel: block drbd2: conn( SyncSource -> Connected )
>> pdsk( Inconsistent -> UpToDate )

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5894 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110309/8b0e190f/attachment.bin>


More information about the drbd-user mailing list