Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I've been trying to solve a strange issue with drbd but I dont see it clear. I have 6 servers (3 pairs) in the same datacenter, but one pair don't want to work fine... They have same config, linux distribution and kernel. drbd-utils are from debian lenny package (same verion on all nodes). Drbd partitions are configured over lvm. The two non working nodes has same network cards as the other working nodes. Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01) Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) drbd.conf skip { } global { usage-count no; } common { syncer { rate 33M; # verify-alg crc32c; } } resource r0 { protocol C; handlers { # pri-on-incon-degr "echo o > /proc/sysrq-trigger ; ifconfig eth0 down"; # pri-lost-after-sb "echo o > /proc/sysrq-trigger ; ifconfig eth0 down"; local-io-error "/sbin/drbd-io-error"; # pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root at almiralabs.com"; # split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root at almiralabs.com"; } startup { wfc-timeout 20; degr-wfc-timeout 10; } disk { on-io-error pass_on; # fencing resource-and-stonith; } net { after-sb-0pri discard-younger-primary; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; # rr-conflict disconnect; # sndbuf-size 512k; # max-buffers 2048; # max-epoch-size 2048; # ko-count 4; cram-hmac-alg "sha1"; shared-secret "xxxxx"; } syncer { rate 33M; al-extents 257; # verify-alg crc32c; } on s5 { device /dev/drbd0; # disk /dev/md4; disk /dev/vg404/readhatAS5-hd2; address 123.123.123.123:7788; flexible-meta-disk internal; } on s6 { device /dev/drbd0; # disk /dev/md4; disk /dev/vg404/readhatAS5-hd2; address 312.312.312.321:7788; flexible-meta-disk internal; } } /proc/drbd version: 8.3.3rc2 (api:88/proto:86-91) GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by root at ns360164.ovh.net, 2009-09-29 12:02:15 0: cs:SyncSource ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:52427164 [>....................] sync'ed: 0.1% (51196/51196)M finish: 546:06:58 speed: 0 (0) K/sec s6:~# cat /proc/drbd version: 8.3.3rc2 (api:88/proto:86-91) GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by root at ns360164.ovh.net, 2009-09-29 12:02:15 0: cs:SyncSource ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:52427164 [>....................] sync'ed: 0.1% (51196/51196)M stalled May 21 18:19:04 s6 kernel: block drbd0: Starting receiver thread (from drbd0_worker [17029]) May 21 18:19:04 s6 kernel: block drbd0: receiver (re)started May 21 18:19:04 s6 kernel: block drbd0: conn( Unconnected -> WFConnection ) May 21 18:19:04 s6 kernel: block drbd0: Handshake successful: Agreed network protocol version 91 May 21 18:19:04 s6 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC May 21 18:19:04 s6 kernel: block drbd0: conn( WFConnection -> WFReportParams ) May 21 18:19:04 s6 kernel: block drbd0: Starting asender thread (from drbd0_receiver [17095]) May 21 18:19:04 s6 kernel: block drbd0: data-integrity-alg: <not-used> May 21 18:19:04 s6 kernel: block drbd0: drbd_sync_handshake: May 21 18:19:04 s6 kernel: block drbd0: self 99DAB77B149D94CC:0000000000000000:0000000000000000:0000000000000000 bits:13106791 flag s:0 May 21 18:19:04 s6 kernel: block drbd0: peer DC7E66E5C2D873B7:99DAB77B149D94CD:0000000000000004:0000000000000000 bits:13106791 flag s:0 May 21 18:19:04 s6 kernel: block drbd0: uuid_compare()=-1 by rule 50 May 21 18:19:04 s6 kernel: block drbd0: Becoming sync target due to disk states. May 21 18:19:04 s6 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 21 18:19:05 s6 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) May 21 18:19:05 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 May 21 18:19:05 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 3 (0x300) May 21 18:19:05 s6 kernel: block drbd0: before-resync-target handler returned 3, dropping connection. May 21 18:19:05 s6 kernel: block drbd0: peer( Primary -> Unknown ) conn( WFSyncUUID -> Disconnecting ) pdsk( UpToDate -> DUnknown ) May 21 18:19:05 s6 kernel: block drbd0: asender terminated May 21 18:19:05 s6 kernel: block drbd0: Terminating asender thread May 21 18:19:05 s6 kernel: block drbd0: Connection closed May 21 18:19:05 s6 kernel: block drbd0: conn( Disconnecting -> StandAlone ) May 21 18:19:05 s6 kernel: block drbd0: receiver terminated May 21 18:19:05 s6 kernel: block drbd0: Terminating receiver thread May 21 18:19:13 s6 kernel: block drbd0: disk( Inconsistent -> Diskless ) May 21 18:19:13 s6 kernel: block drbd0: drbd_bm_resize called with capacity == 0 May 21 18:19:13 s6 kernel: block drbd0: worker terminated May 21 18:19:13 s6 kernel: block drbd0: Terminating worker thread Other times I can no see sync progress, because it fails with this output: May 21 17:44:22 s6 kernel: block drbd0: Starting worker thread (from cqueue [229]) May 21 17:44:22 s6 kernel: block drbd0: disk( Diskless -> Attaching ) May 21 17:44:22 s6 kernel: block drbd0: No usable activity log found. May 21 17:44:22 s6 kernel: block drbd0: Method to ensure write ordering: barrier May 21 17:44:22 s6 kernel: block drbd0: Backing device's merge_bvec_fn() = ffffffff8083cb50 May 21 17:44:22 s6 kernel: block drbd0: max_segment_size ( = BIO size ) = 4096 May 21 17:44:22 s6 kernel: block drbd0: drbd_bm_resize called with capacity == 104854328 May 21 17:44:22 s6 kernel: block drbd0: resync bitmap: bits=13106791 words=204794 May 21 17:44:22 s6 kernel: block drbd0: size = 50 GB (52427164 KB) May 21 17:44:22 s6 kernel: block drbd0: Writing the whole bitmap, size changed May 21 17:44:22 s6 kernel: block drbd0: 50 GB (13106791 bits) marked out-of-sync by on disk bit-map. May 21 17:44:22 s6 kernel: block drbd0: recounting of set bits took additional 0 jiffies May 21 17:44:22 s6 kernel: block drbd0: 50 GB (13106791 bits) marked out-of-sync by on disk bit-map. May 21 17:44:22 s6 kernel: block drbd0: disk( Attaching -> Inconsistent ) May 21 17:45:30 s6 kernel: block drbd0: conn( StandAlone -> Unconnected ) May 21 17:45:30 s6 kernel: block drbd0: Starting receiver thread (from drbd0_worker [11346]) May 21 17:45:30 s6 kernel: block drbd0: receiver (re)started May 21 17:45:30 s6 kernel: block drbd0: conn( Unconnected -> WFConnection ) May 21 17:45:31 s6 kernel: block drbd0: Handshake successful: Agreed network protocol version 91 May 21 17:45:31 s6 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC May 21 17:45:31 s6 kernel: block drbd0: conn( WFConnection -> WFReportParams ) May 21 17:45:31 s6 kernel: block drbd0: Starting asender thread (from drbd0_receiver [11772]) May 21 17:45:31 s6 kernel: block drbd0: data-integrity-alg: <not-used> May 21 17:45:31 s6 kernel: block drbd0: drbd_sync_handshake: May 21 17:45:31 s6 kernel: block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:13106791 flags:0 May 21 17:45:31 s6 kernel: block drbd0: peer EEA95651B1BD8559:0000000000000004:0000000000000000:0000000000000000 bits:13106791 flags:0 May 21 17:45:31 s6 kernel: block drbd0: uuid_compare()=-2 by rule 20 May 21 17:45:31 s6 kernel: block drbd0: Becoming sync target due to disk states. May 21 17:45:31 s6 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. May 21 17:45:31 s6 kernel: block drbd0: 50 GB (13106791 bits) marked out-of-sync by on disk bit-map. May 21 17:45:31 s6 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 21 17:45:31 s6 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) May 21 17:45:31 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 May 21 17:45:31 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 3 (0x300) May 21 17:45:31 s6 kernel: block drbd0: before-resync-target handler returned 3, dropping connection. May 21 17:45:31 s6 kernel: block drbd0: peer( Primary -> Unknown ) conn( WFSyncUUID -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Im a bit stranged about this, cause others nodes works well, so I don't know if it could be a network card or disk fail. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100521/61952778/attachment.htm>