[DRBD-user] FAQ: Reconnecting after a temporary primary node failure

Mauricio Tavares raubvogel at gmail.com
Tue Mar 8 14:34:31 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Mar 8, 2011 at 8:03 AM, Felix Frank <ff at mpexnet.de> wrote:
>>       Which logs should I check besides the kernel ones I included in
>> the first post (for both db1 and db2)? The db1 one has not changed,
>> but since I've done drbdadm connect r0 in db1, db2's kernel log
>> (included below on its entirety from the moment I told db1 to
>> reconnect) is complaining about a split-brain
>>
>
> <snip>
>
>>
>>  even though it never became a primary node,
>>
>
> That's a bold claim facing a split-brain ;-)
>
      I hear you, which is why I am here hoping someone can see what I
am not seeing. =)

> grep for any occurences of the string "Primary" on your secondary node's
> log.
>
      To make sure I am not missing something, I decided to start
over.  First I manually solved split brain:

root at db2:~# drbdadm -- --discard-my-data connect r0
root at db2:~#

root at db2:~# tail /var/log/kern.log
Mar  8 08:08:39 db2 kernel: [83182.031454] block drbd0: peer( Unknown
-> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown ->
UpToDate )
Mar  8 08:08:39 db2 kernel: [83182.045544] block drbd0: conn(
WFBitMapT -> WFSyncUUID )
Mar  8 08:08:39 db2 kernel: [83182.046317] block drbd0: helper
command: /sbin/drbdadm before-resync-target minor-0
Mar  8 08:08:39 db2 kernel: [83182.051880] block drbd0: helper
command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Mar  8 08:08:39 db2 kernel: [83182.051898] block drbd0: conn(
WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Mar  8 08:08:39 db2 kernel: [83182.051913] block drbd0: Began resync
as SyncTarget (will sync 18596 KB [4649 bits set]).
Mar  8 08:08:40 db2 kernel: [83182.963831] block drbd0: Resync done
(total 1 sec; paused 0 sec; 18596 K/sec)
Mar  8 08:08:40 db2 kernel: [83182.963839] block drbd0: conn(
SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar  8 08:08:40 db2 kernel: [83182.963845] block drbd0: helper
command: /sbin/drbdadm after-resync-target minor-0
Mar  8 08:08:40 db2 kernel: [83182.966269] block drbd0: helper
command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
root at db2:~#

root at db1:~# tail /var/log/kern.log
Mar  8 08:08:39 db1 kernel: [50204.314933] block drbd0: Split-Brain
detected, manually solved. Sync from this node
Mar  8 08:08:39 db1 kernel: [50204.314944] block drbd0: peer( Unknown
-> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown ->
UpToDate )
Mar  8 08:08:39 db1 kernel: [50204.329532] block drbd0: conn(
WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Mar  8 08:08:39 db1 kernel: [50204.329553] block drbd0: Began resync
as SyncSource (will sync 18596 KB [4649 bits set]).
Mar  8 08:08:40 db1 kernel: [50205.248874] block drbd0: Resync done
(total 1 sec; paused 0 sec; 18596 K/sec)
Mar  8 08:08:40 db1 kernel: [50205.248894] block drbd0: conn(
SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
root at db1:~#

Then I disconnected db1 again. I am putting the time here just before
I disconnected ethernet on db1 so it should be easier to track down on
the logs:

root at db1:~# date
Tue Mar  8 08:11:23 EST 2011
root at db1:~#

So, in db2 (messages and syslog show the same output as kern.log),
Mar  8 08:11:40 db2 kernel: [83363.432218] block drbd0: PingAck did
not arrive in time.
Mar  8 08:11:40 db2 kernel: [83363.521913] block drbd0: peer( Primary
-> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
   ^^^ I guess this means it lost the primary, right?

Mar  8 08:11:40 db2 kernel: [83363.521931] block drbd0: asender terminated
Mar  8 08:11:40 db2 kernel: [83363.521933] block drbd0: Terminating
asender thread
Mar  8 08:11:40 db2 kernel: [83363.522013] block drbd0: short read
expecting header on sock: r=-512
Mar  8 08:11:41 db2 kernel: [83363.602519] block drbd0: Connection closed
Mar  8 08:11:41 db2 kernel: [83363.602526] block drbd0: conn(
NetworkFailure -> Unconnected )
Mar  8 08:11:41 db2 kernel: [83363.602531] block drbd0: receiver terminated
Mar  8 08:11:41 db2 kernel: [83363.602532] block drbd0: Restarting
receiver thread
Mar  8 08:11:41 db2 kernel: [83363.602534] block drbd0: receiver (re)started
Mar  8 08:11:41 db2 kernel: [83363.602537] block drbd0: conn(
Unconnected -> WFConnection )

I reconnected db1 after 8:15. db1 reports when link goes down and back up:

Mar  8 08:11:33 db1 kernel: [50378.324798] e1000: eth0 NIC Link is Down
Mar  8 08:11:40 db1 kernel: [50385.700785] block drbd0: PingAck did
not arrive in time.
Mar  8 08:11:40 db1 kernel: [50385.700804] block drbd0: peer(
Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
UpToDate -> DUnknown )
Mar  8 08:11:40 db1 kernel: [50385.700886] block drbd0: asender terminated
Mar  8 08:11:40 db1 kernel: [50385.700900] block drbd0: Terminating
asender thread
Mar  8 08:11:40 db1 kernel: [50385.700925] block drbd0: short read
expecting header on sock: r=-512
Mar  8 08:11:40 db1 kernel: [50385.700949] block drbd0: Creating new
current UUID
Mar  8 08:11:40 db1 kernel: [50385.701155] block drbd0: Connection closed
Mar  8 08:11:40 db1 kernel: [50385.701157] block drbd0: conn(
NetworkFailure -> Unconnected )
Mar  8 08:11:40 db1 kernel: [50385.701163] block drbd0: receiver terminated
Mar  8 08:11:40 db1 kernel: [50385.701165] block drbd0: Restarting
receiver thread
Mar  8 08:11:40 db1 kernel: [50385.701167] block drbd0: receiver (re)started
Mar  8 08:11:40 db1 kernel: [50385.701169] block drbd0: conn(
Unconnected -> WFConnection )
Mar  8 08:11:40 db1 kernel: [50385.701176] block drbd0: bind before
connect failed, err = -99
Mar  8 08:11:40 db1 kernel: [50385.701177] block drbd0: conn(
WFConnection -> Disconnecting )
Mar  8 08:11:40 db1 kernel: [50385.701184] block drbd0: Discarding
network configuration.
Mar  8 08:11:40 db1 kernel: [50385.701189] block drbd0: Connection closed
Mar  8 08:11:40 db1 kernel: [50385.701190] block drbd0: conn(
Disconnecting -> StandAlone )
Mar  8 08:11:40 db1 kernel: [50385.703062] block drbd0: receiver terminated
Mar  8 08:11:40 db1 kernel: [50385.703067] block drbd0: Terminating
receiver thread
Mar  8 08:15:25 db1 kernel: [50610.433167] e1000: eth0 NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX
root at db1:~#

And, that is about it at this point in time (8:24). The db2
kern.log/syslog/messages have not been changed/updated since I
disconnected db1. The /proc/drdb files for both machines look like
this right now:

root at db1:~# cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at db1,
2011-03-07 15:01:39
 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown   r----
    ns:18596 nr:0 dw:241020 dr:56357 al:36 bm:31 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0
root at db1:~#

root at db2:~# cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at db2,
2011-03-07 10:22:02
 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:18596 dw:383134 dr:271064 al:8 bm:65 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0
root at db2:~#

Grepping for Primary in the log files in db2 only show what is in it's
kern.log above:

/var/log/syslog:Mar  8 08:11:40 db2 kernel: [83363.521913] block
drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk( Up
ToDate -> DUnknown )
/var/log/kern.log:Mar  8 08:11:40 db2 kernel: [83363.521913] block
drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk(
UpToDate -> DUnknown )
/var/log/messages:Mar  8 08:11:40 db2 kernel: [83363.521913] block
drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
pdsk(
UpToDate -> DUnknown )

I feel like I must be missing something obvious here but can't figure
out what...



More information about the drbd-user mailing list