Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I am in a very wired situation where I have four DRBD devices setup as primary/primary configuration. I do have a customized upper layer that takes care of the failover related issue(similar to Pacemaker). In my test environment, first time after OS and DRBD binaries get installed, I initiate some IOs from one my server, reboot the machine after 4 or 5 mins while IO is still running. During the machine reboots, All IO fails over to the partner server without any problem. When the rebooted machine comes back, split brain happens only to the disk that has IO going on. But If repeat the above experiment again after split brain been manually resolved, DRBD somehow figures out the sync direction correctly according to my configuration. Can someone help me interpret the syslog and give me a hint on why split brain happens only for the first time but not after its been resolved manually. Here is my drbd.conf file: common { protocol C; } #=#= 1 resource drbd1 { on f33 { device /dev/drbd1; #meta sd /dev/sdb disk /dev/disk/by-id/scsi-360030480003ae2e0151e54b20c1f82e0; address 192.168.250.1:7790; meta-disk internal; } on f34 { device /dev/drbd1; #meta sd /dev/sdb disk /dev/disk/by-id/scsi-360030480003ae32015095d1f11bf902b; address 192.168.250.2:7790; meta-disk internal; } net { allow-two-primaries; after-sb-0pri discard-least-changes; after-sb-1pri consensus; after-sb-2pri violently-as0p; rr-conflict violently; max-buffers 8000; max-epoch-size 8000; unplug-watermark 16; sndbuf-size 0; } syncer { rate 300M; verify-alg crc32c; al-extents 3800; } startup { become-primary-on both; } handlers { before-resync-target "/sbin/before_resync_target.sh"; after-resync-target "/sbin/after_resync_target.sh"; } } # some other drbd resource Apr 11 16:16:07 f33 kernel: block drbd1: Starting worker thread (from cqueue/9 [356]) Apr 11 16:16:07 f33 kernel: block drbd1: disk( Diskless -> Attaching ) Apr 11 16:16:07 f33 kernel: block drbd1: Found 39 transactions (39 active extents) in activity log. Apr 11 16:16:07 f33 kernel: block drbd1: Method to ensure write ordering: barrier Apr 11 16:16:07 f33 kernel: block drbd1: max_segment_size ( = BIO size ) = 65536 Apr 11 16:16:07 f33 kernel: block drbd1: drbd_bm_resize called with capacity == 25164984 Apr 11 16:16:07 f33 kernel: block drbd1: resync bitmap: bits=3145623 words=49151 Apr 11 16:16:07 f33 kernel: block drbd1: size = 12 GB (12582492 KB) Apr 11 16:16:07 f33 kernel: block drbd1: recounting of set bits took additional 0 jiffies Apr 11 16:16:07 f33 kernel: block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Apr 11 16:16:07 f33 kernel: block drbd1: Marked additional 156 MB as out-of-sync based on AL. Apr 11 16:16:07 f33 kernel: block drbd1: disk( Attaching -> UpToDate ) Apr 11 16:16:08 f33 kernel: block drbd1: conn( StandAlone -> Unconnected ) Apr 11 16:16:08 f33 kernel: block drbd1: Starting receiver thread (from drbd1_worker [4269]) Apr 11 16:16:08 f33 kernel: block drbd1: receiver (re)started Apr 11 16:16:08 f33 kernel: block drbd1: conn( Unconnected -> WFConnection ) Apr 11 16:16:08 f33 kernel: block drbd1: Handshake successful: Agreed network protocol version 95 Apr 11 16:16:08 f33 kernel: block drbd1: conn( WFConnection -> WFReportParams ) Apr 11 16:16:08 f33 kernel: block drbd1: Starting asender thread (from drbd1_receiver [4359]) Apr 11 16:16:08 f33 kernel: block drbd1: data-integrity-alg: <not-used> Apr 11 16:16:08 f33 kernel: block drbd1: max_segment_size ( = BIO size ) = 65536 Apr 11 16:16:08 f33 kernel: block drbd1: drbd_sync_handshake: Apr 11 16:16:08 f33 kernel: block drbd1: self 264D92621CE57A74:CBC54463A29032C9:72FCD7718269F032:0000000000000004 bits:39936 flags:0 Apr 11 16:16:08 f33 kernel: block drbd1: peer 1CB5F82C62D88D81:CBC54463A29032C9:72FCD7718269F033:0000000000000004 bits:1 flags:0 Apr 11 16:16:08 f33 kernel: block drbd1: uuid_compare()=100 by rule 90 Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0) Apr 11 16:16:08 f33 kernel: block drbd1: Split-Brain detected but unresolved, dropping connection! Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 Apr 11 16:16:08 f33 kernel: block drbd1: meta connection shut down by peer. Apr 11 16:16:08 f33 kernel: block drbd1: conn( WFReportParams -> NetworkFailure ) Apr 11 16:16:08 f33 kernel: block drbd1: asender terminated Apr 11 16:16:08 f33 kernel: block drbd1: Terminating asender thread Apr 11 16:16:08 f33 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) Apr 11 16:16:08 f33 kernel: block drbd1: conn( NetworkFailure -> Disconnecting ) Apr 11 16:16:08 f33 kernel: block drbd1: error receiving ReportState, l: 4! Apr 11 16:16:08 f33 kernel: block drbd1: Connection closed Apr 11 16:16:08 f33 kernel: block drbd1: conn( Disconnecting -> StandAlone ) Apr 11 16:16:08 f33 kernel: block drbd1: receiver terminated Apr 11 16:16:08 f33 kernel: block drbd1: Terminating receiver thread Here is the related log from syslog: Commit yourself to constant self-improvement -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110411/a9136e90/attachment.htm>