Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Mar 8, 2011 at 8:03 AM, Felix Frank <ff at mpexnet.de> wrote: >> Which logs should I check besides the kernel ones I included in >> the first post (for both db1 and db2)? The db1 one has not changed, >> but since I've done drbdadm connect r0 in db1, db2's kernel log >> (included below on its entirety from the moment I told db1 to >> reconnect) is complaining about a split-brain >> > > <snip> > >> >> even though it never became a primary node, >> > > That's a bold claim facing a split-brain ;-) > I hear you, which is why I am here hoping someone can see what I am not seeing. =) > grep for any occurences of the string "Primary" on your secondary node's > log. > To make sure I am not missing something, I decided to start over. First I manually solved split brain: root at db2:~# drbdadm -- --discard-my-data connect r0 root at db2:~# root at db2:~# tail /var/log/kern.log Mar 8 08:08:39 db2 kernel: [83182.031454] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Mar 8 08:08:39 db2 kernel: [83182.045544] block drbd0: conn( WFBitMapT -> WFSyncUUID ) Mar 8 08:08:39 db2 kernel: [83182.046317] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Mar 8 08:08:39 db2 kernel: [83182.051880] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Mar 8 08:08:39 db2 kernel: [83182.051898] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Mar 8 08:08:39 db2 kernel: [83182.051913] block drbd0: Began resync as SyncTarget (will sync 18596 KB [4649 bits set]). Mar 8 08:08:40 db2 kernel: [83182.963831] block drbd0: Resync done (total 1 sec; paused 0 sec; 18596 K/sec) Mar 8 08:08:40 db2 kernel: [83182.963839] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Mar 8 08:08:40 db2 kernel: [83182.963845] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 Mar 8 08:08:40 db2 kernel: [83182.966269] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) root at db2:~# root at db1:~# tail /var/log/kern.log Mar 8 08:08:39 db1 kernel: [50204.314933] block drbd0: Split-Brain detected, manually solved. Sync from this node Mar 8 08:08:39 db1 kernel: [50204.314944] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Mar 8 08:08:39 db1 kernel: [50204.329532] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Mar 8 08:08:39 db1 kernel: [50204.329553] block drbd0: Began resync as SyncSource (will sync 18596 KB [4649 bits set]). Mar 8 08:08:40 db1 kernel: [50205.248874] block drbd0: Resync done (total 1 sec; paused 0 sec; 18596 K/sec) Mar 8 08:08:40 db1 kernel: [50205.248894] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) root at db1:~# Then I disconnected db1 again. I am putting the time here just before I disconnected ethernet on db1 so it should be easier to track down on the logs: root at db1:~# date Tue Mar 8 08:11:23 EST 2011 root at db1:~# So, in db2 (messages and syslog show the same output as kern.log), Mar 8 08:11:40 db2 kernel: [83363.432218] block drbd0: PingAck did not arrive in time. Mar 8 08:11:40 db2 kernel: [83363.521913] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) ^^^ I guess this means it lost the primary, right? Mar 8 08:11:40 db2 kernel: [83363.521931] block drbd0: asender terminated Mar 8 08:11:40 db2 kernel: [83363.521933] block drbd0: Terminating asender thread Mar 8 08:11:40 db2 kernel: [83363.522013] block drbd0: short read expecting header on sock: r=-512 Mar 8 08:11:41 db2 kernel: [83363.602519] block drbd0: Connection closed Mar 8 08:11:41 db2 kernel: [83363.602526] block drbd0: conn( NetworkFailure -> Unconnected ) Mar 8 08:11:41 db2 kernel: [83363.602531] block drbd0: receiver terminated Mar 8 08:11:41 db2 kernel: [83363.602532] block drbd0: Restarting receiver thread Mar 8 08:11:41 db2 kernel: [83363.602534] block drbd0: receiver (re)started Mar 8 08:11:41 db2 kernel: [83363.602537] block drbd0: conn( Unconnected -> WFConnection ) I reconnected db1 after 8:15. db1 reports when link goes down and back up: Mar 8 08:11:33 db1 kernel: [50378.324798] e1000: eth0 NIC Link is Down Mar 8 08:11:40 db1 kernel: [50385.700785] block drbd0: PingAck did not arrive in time. Mar 8 08:11:40 db1 kernel: [50385.700804] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Mar 8 08:11:40 db1 kernel: [50385.700886] block drbd0: asender terminated Mar 8 08:11:40 db1 kernel: [50385.700900] block drbd0: Terminating asender thread Mar 8 08:11:40 db1 kernel: [50385.700925] block drbd0: short read expecting header on sock: r=-512 Mar 8 08:11:40 db1 kernel: [50385.700949] block drbd0: Creating new current UUID Mar 8 08:11:40 db1 kernel: [50385.701155] block drbd0: Connection closed Mar 8 08:11:40 db1 kernel: [50385.701157] block drbd0: conn( NetworkFailure -> Unconnected ) Mar 8 08:11:40 db1 kernel: [50385.701163] block drbd0: receiver terminated Mar 8 08:11:40 db1 kernel: [50385.701165] block drbd0: Restarting receiver thread Mar 8 08:11:40 db1 kernel: [50385.701167] block drbd0: receiver (re)started Mar 8 08:11:40 db1 kernel: [50385.701169] block drbd0: conn( Unconnected -> WFConnection ) Mar 8 08:11:40 db1 kernel: [50385.701176] block drbd0: bind before connect failed, err = -99 Mar 8 08:11:40 db1 kernel: [50385.701177] block drbd0: conn( WFConnection -> Disconnecting ) Mar 8 08:11:40 db1 kernel: [50385.701184] block drbd0: Discarding network configuration. Mar 8 08:11:40 db1 kernel: [50385.701189] block drbd0: Connection closed Mar 8 08:11:40 db1 kernel: [50385.701190] block drbd0: conn( Disconnecting -> StandAlone ) Mar 8 08:11:40 db1 kernel: [50385.703062] block drbd0: receiver terminated Mar 8 08:11:40 db1 kernel: [50385.703067] block drbd0: Terminating receiver thread Mar 8 08:15:25 db1 kernel: [50610.433167] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX root at db1:~# And, that is about it at this point in time (8:24). The db2 kern.log/syslog/messages have not been changed/updated since I disconnected db1. The /proc/drdb files for both machines look like this right now: root at db1:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at db1, 2011-03-07 15:01:39 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r---- ns:18596 nr:0 dw:241020 dr:56357 al:36 bm:31 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 root at db1:~# root at db2:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at db2, 2011-03-07 10:22:02 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r---- ns:0 nr:18596 dw:383134 dr:271064 al:8 bm:65 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 root at db2:~# Grepping for Primary in the log files in db2 only show what is in it's kern.log above: /var/log/syslog:Mar 8 08:11:40 db2 kernel: [83363.521913] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( Up ToDate -> DUnknown ) /var/log/kern.log:Mar 8 08:11:40 db2 kernel: [83363.521913] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) /var/log/messages:Mar 8 08:11:40 db2 kernel: [83363.521913] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) I feel like I must be missing something obvious here but can't figure out what...