[DRBD-user] DRBD tweaks

Thu May 27 21:40:19 CEST 2004

/ 2004-05-27 13:45:30 -0500
\ Roy Bixler:
> We have been using DRBD for close to a year now and it has been working well 
> for us.  However, recently a couple of messages have appeared in the log and 
> I'm not sure of their severity or of the best way of remedying them.  We use 
> Linux 2.4.26 kernel, DRBD 0.6.12 and the ext3 filesystem on a Debian Woody 
> system.
> 
> First, intermittently (a week or so apart), I saw the following message:
> 
> drbd0: [kjournald/982] sock_sendmsg time expired, ko = 4294967295

this happens if for the duration of one "timeout" (typically 6 seconds)
drbd cannot get one data block (typically 4K) through on its data
socket, but the drbd ping packets on the meta socket are still answered
in time.

this means that sometimes your network or nodes (maybe memory?)
are really busy.

the ko count is reset with each new data block,
and decremented with each drbd ping packet.
if ko count hits zero, connection is dropped (and NOT reestablished!)
to reconnect, you have to "/etc/init.d/drbd reconnect <resource_name>"
the default of ko count is zero, so it will wrap around at the first
ping to 1<<32 -1 == 4294967295 which effectively disables it;
you still see the warnings about it in the kernel log, so you know that
there probably is something wrong.
to enable the connection drop -> Standalone, give ko-count in the net
section some not too small positive value.

> Today, I saw this message:
> 
> drbd0: transferlog too small!!
> 
> I believe this involves setting the "tl-size" parameter.  How serious is this?  

you risk strict write ordering.
violation is not very likely, but you risk it.

> What would be the best way to change this on both nodes?  For example, would 
> a procedure like the following work?  (Would it be overkill?)
> 
> a) change "tl-size" on secondary
> b) do /etc/init.d/drbd reconnect
> c) fail over to secondary
> d) change "tl-size" on former primary
> e) repeat step b
> f) fail back to original primary

you don't need to failover.
just edit the config file(s) [1] for increased tl-size,
and afterwards do
 /etc/init.d/drbd reconnect
on both nodes. you will see a couple of very fast SyncQuick's 

> Another thing is that I am running protocol B and I understand that protocol C 
> is preferred.  I assume the only way to safely switch protocols is to take 
> down the entire cluster, make the change and bring it back up again.

you can
 a) edit the conf files [1] for proto C # which will not have any effect yet.
 b) stop all drbds which currently are in Secondary state
 c) then do /etc/init.d/drbd reconnect for those in Primary state,
 d) start all drbds stopped in b)

if you have all primary on the same node,
b-d simplifies to 
 on Secondary: drbd stop
 on Primary:   drbd reconnect
 on Secondary: drbd start

	Lars Ellenberg

[1]
  you can edit a copy fist,
  and drbd --config drbd.conf.new checkconfig
  to be sure you did not make something stupid...