Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I'm doing something wrong... Definitely, since when I have two primary-primary nodes running and reboot one of the machines - it comes back with: root at oscar ~> cat /proc/drbd version: 8.3.11 (api:88/proto:86-96) srcversion: EBD919353D1D1CCDD0DFBD3 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:0 dw:789 dr:263507 al:43 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:172036 Then I have to disconnect, set it to secondary, reconnect with discarding own data, then setting it as primary again. This is how it looks like in dmesg of the machine that wasn't rebooted: [41706.085879] block drbd0: PingAck did not arrive in time. [41706.085888] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [41706.086007] block drbd0: new current UUID 62770026DDB5FC9D:1AD40906305F01A9:E24FA72FCFB3A8FD:E24EA72FCFB3A8FD [41706.136868] block drbd0: asender terminated [41706.136874] block drbd0: Terminating drbd0_asender [41706.145263] block drbd0: Connection closed [41706.145270] block drbd0: conn( NetworkFailure -> Unconnected ) [41706.145275] block drbd0: receiver terminated [41706.145278] block drbd0: Restarting drbd0_receiver [41706.145281] block drbd0: receiver (re)started [41706.145285] block drbd0: conn( Unconnected -> WFConnection ) [41790.980795] block drbd0: Handshake successful: Agreed network protocol version 96 [41790.980807] block drbd0: conn( WFConnection -> WFReportParams ) [41790.980927] block drbd0: Starting asender thread (from drbd0_receiver [32089]) [41790.981155] block drbd0: data-integrity-alg: <not-used> [41790.981170] block drbd0: drbd_sync_handshake: [41790.981178] block drbd0: self 62770026DDB5FC9D:1AD40906305F01A9:E24FA72FCFB3A8FD:E24EA72FCFB3A8FD bits:87 flags:0 [41790.981183] block drbd0: peer F2FB2C2B205B3041:1AD40906305F01A9:E24FA72FCFB3A8FC:E24EA72FCFB3A8FD bits:43008 flags:2 [41790.981187] block drbd0: uuid_compare()=100 by rule 90 [41790.981191] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 [41790.983528] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) [41790.983531] block drbd0: Split-Brain detected but unresolved, dropping connection! [41790.983534] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 [41790.985235] block drbd0: meta connection shut down by peer. [41790.985238] block drbd0: conn( WFReportParams -> NetworkFailure ) [41790.985242] block drbd0: asender terminated [41790.985243] block drbd0: Terminating drbd0_asender [41790.985274] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) [41790.985277] block drbd0: conn( NetworkFailure -> Disconnecting ) [41790.985279] block drbd0: error receiving ReportState, l: 4! [41790.985310] block drbd0: Connection closed [41790.985315] block drbd0: conn( Disconnecting -> StandAlone ) [41790.985325] block drbd0: receiver terminated [41790.985327] block drbd0: Terminating drbd0_receiver and this is how it looks in dmesg of rebooted machine: [ 7.709704] drbd: initialized. Version: 8.3.11 (api:88/proto:86-96) [ 7.709707] drbd: srcversion: EBD919353D1D1CCDD0DFBD3 [ 7.709708] drbd: registered as block device major 147 [ 7.709709] drbd: minor_table @ 0xffff8805fab93000 [ 7.747400] md2: unknown partition table [ 7.747804] block drbd0: Starting worker thread (from drbdsetup-83 [1792]) [ 7.747919] block drbd0: disk( Diskless -> Attaching ) [ 7.816449] block drbd0: Found 4 transactions (49 active extents) in activity log. [ 7.816453] block drbd0: Method to ensure write ordering: flush [ 7.816457] block drbd0: max BIO size = 131072 [ 7.816461] block drbd0: drbd_bm_resize called with capacity == 838832808 [ 7.818314] block drbd0: resync bitmap: bits=104854101 words=1638346 pages=3200 [ 7.818318] block drbd0: size = 400 GB (419416404 KB) [ 8.394005] block drbd0: bitmap READ of 3200 pages took 144 jiffies [ 8.395579] block drbd0: recounting of set bits took additional 1 jiffies [ 8.395582] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. [ 8.395593] block drbd0: Marked additional 168 MB as out-of-sync based on AL. [ 8.395622] block drbd0: bitmap WRITE of 0 pages took 0 jiffies [ 8.425658] block drbd0: 168 MB (43008 bits) marked out-of-sync by on disk bit-map. [ 8.425667] block drbd0: disk( Attaching -> UpToDate ) [ 8.425671] block drbd0: attached to UUIDs F2FB2C2B205B3041:1AD40906305F01A9:E24FA72FCFB3A8FC:E24EA72FCFB3A8FD [ 8.458239] block drbd0: conn( StandAlone -> Unconnected ) [ 8.458246] block drbd0: Starting receiver thread (from drbd0_worker [1793]) [ 8.458301] block drbd0: receiver (re)started [ 8.458306] block drbd0: conn( Unconnected -> WFConnection ) [ 8.635026] block drbd0: role( Secondary -> Primary ) [ 8.701089] OCFS2 Node Manager 1.5.0 [ 8.707547] OCFS2 DLM 1.5.0 [ 8.713672] OCFS2 DLMFS 1.5.0 [ 8.713730] OCFS2 User DLM kernel interface loaded [ 9.185831] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx [ 13.436595] block drbd0: Handshake successful: Agreed network protocol version 96 [ 13.436605] block drbd0: conn( WFConnection -> WFReportParams ) [ 13.436733] block drbd0: Starting asender thread (from drbd0_receiver [1799]) [ 13.437262] block drbd0: data-integrity-alg: <not-used> [ 13.437460] block drbd0: drbd_sync_handshake: [ 13.437466] block drbd0: self F2FB2C2B205B3041:1AD40906305F01A9:E24FA72FCFB3A8FC:E24EA72FCFB3A8FD bits:43008 flags:0 [ 13.437471] block drbd0: peer 62770026DDB5FC9D:1AD40906305F01A9:E24FA72FCFB3A8FD:E24EA72FCFB3A8FD bits:87 flags:0 [ 13.437475] block drbd0: uuid_compare()=100 by rule 90 [ 13.437480] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 [ 13.439492] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) [ 13.439495] block drbd0: Split-Brain detected but unresolved, dropping connection! [ 13.439498] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 [ 13.441096] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) [ 13.441099] block drbd0: conn( WFReportParams -> Disconnecting ) [ 13.441103] block drbd0: error receiving ReportState, l: 4! [ 13.441111] block drbd0: asender terminated [ 13.441115] block drbd0: Terminating drbd0_asender [ 13.441133] block drbd0: Connection closed [ 13.441137] block drbd0: conn( Disconnecting -> StandAlone ) [ 13.441151] block drbd0: receiver terminated [ 13.441153] block drbd0: Terminating drbd0_receiver So, somehow it notices that there had been a split brain situation and turns receiver off.. I'm running drbd on two identical machines, both running kernel 3.4.6 with linux-vserver patch vs2.3.3.6. drbd is set up on software-raid1 device /dev/md2. I don't know if it is important, but when I start drbd manually, I get: Starting DRBD resources: DRBD module version: 8.3.11 userland version: 8.4.1 preferably kernel and userland versions should match. Configs are identical: resource home { protocol C; meta-disk internal; device /dev/drbd0; disk /dev/md2; net { allow-two-primaries; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } startup { become-primary-on both; } on oscar { address 1.2.3.4:7789; } on papa { address 1.2.3.5:7789; } } DRBD main configuration: global { usage-count yes; } common { handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; } disk { on-io-error detach; } # Drop the disk on io error syncer { rate 50M; # Limit sync speed to 10 MByte/s for FastEthernet } net { } } -- Jacek Osiecki josiecki at silvercube.pl Silvercube s.c. ul. Makuszynskiego 4 31-752 KrakĂłw +48 (12) 684 21 00