Hi, <br><br>I've been trying to solve a strange issue with drbd but I dont see it clear.<br><br>I have 6 servers (3 pairs) in the same datacenter, but one pair don't want to work fine...<br>They have same config, linux distribution and kernel. drbd-utils are from debian lenny package (same verion on all nodes).<br>
Drbd partitions are configured over lvm.<br><br>The two non working nodes has same network cards as the other working nodes.<br><div style="margin-left: 40px;">Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)<br>
</div><div style="margin-left: 40px;">Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)<br></div><br>drbd.conf<br><br>skip {<br>}<br>global {<br> usage-count no;<br>
}<br>common {<br> syncer { <br> rate 33M; <br># verify-alg crc32c;<br> }<br>}<br>resource r0 {<br> protocol C;<br><br> handlers {<br># pri-on-incon-degr "echo o > /proc/sysrq-trigger ; ifconfig eth0 down";<br>
# pri-lost-after-sb "echo o > /proc/sysrq-trigger ; ifconfig eth0 down";<br> local-io-error "/sbin/drbd-io-error";<br># pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' <a href="mailto:root@almiralabs.com">root@almiralabs.com</a>";<br>
# split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' <a href="mailto:root@almiralabs.com">root@almiralabs.com</a>";<br>}<br><br> startup {<br>
wfc-timeout 20;<br> degr-wfc-timeout 10;<br> }<br> disk {<br> on-io-error pass_on;<br># fencing resource-and-stonith;<br> }<br><br> net {<br> after-sb-0pri discard-younger-primary;<br> after-sb-1pri violently-as0p;<br>
after-sb-2pri violently-as0p;<br># rr-conflict disconnect;<br><br># sndbuf-size 512k;<br># max-buffers 2048;<br># max-epoch-size 2048;<br># ko-count 4;<br> cram-hmac-alg "sha1";<br>
shared-secret "xxxxx";<br> }<br> syncer {<br> rate 33M;<br> al-extents 257;<br># verify-alg crc32c;<br> }<br><br> on s5 {<br> device /dev/drbd0;<br># disk /dev/md4;<br> disk /dev/vg404/readhatAS5-hd2;<br>
address <a href="http://123.123.123.123:7788">123.123.123.123:7788</a>;<br> flexible-meta-disk internal;<br><br> }<br><br> on s6 {<br> device /dev/drbd0;<br># disk /dev/md4;<br> disk /dev/vg404/readhatAS5-hd2;<br>
<br> address 312.312.312.321:7788;<br> flexible-meta-disk internal;<br> }<br>}<br><br><br>/proc/drbd <br><br>version: 8.3.3rc2 (api:88/proto:86-91)<br>GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by <a href="mailto:root@ns360164.ovh.net">root@ns360164.ovh.net</a>, 2009-09-29 12:02:15<br>
0: cs:SyncSource ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----<br> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:52427164<br> [>....................] sync'ed: 0.1% (51196/51196)M<br>
finish: 546:06:58 speed: 0 (0) K/sec<br>s6:~# cat /proc/drbd <br>version: 8.3.3rc2 (api:88/proto:86-91)<br>GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by <a href="mailto:root@ns360164.ovh.net">root@ns360164.ovh.net</a>, 2009-09-29 12:02:15<br>
0: cs:SyncSource ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----<br> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:52427164<br> [>....................] sync'ed: 0.1% (51196/51196)M<br>
stalled<br><br><br>May 21 18:19:04 s6 kernel: block drbd0: Starting receiver thread (from drbd0_worker [17029])<br>May 21 18:19:04 s6 kernel: block drbd0: receiver (re)started<br>May 21 18:19:04 s6 kernel: block drbd0: conn( Unconnected -> WFConnection ) <br>
May 21 18:19:04 s6 kernel: block drbd0: Handshake successful: Agreed network protocol version 91<br>May 21 18:19:04 s6 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC<br>May 21 18:19:04 s6 kernel: block drbd0: conn( WFConnection -> WFReportParams ) <br>
May 21 18:19:04 s6 kernel: block drbd0: Starting asender thread (from drbd0_receiver [17095])<br>May 21 18:19:04 s6 kernel: block drbd0: data-integrity-alg: <not-used><br>May 21 18:19:04 s6 kernel: block drbd0: drbd_sync_handshake:<br>
May 21 18:19:04 s6 kernel: block drbd0: self 99DAB77B149D94CC:0000000000000000:0000000000000000:0000000000000000 bits:13106791 flag<br>s:0<br>May 21 18:19:04 s6 kernel: block drbd0: peer DC7E66E5C2D873B7:99DAB77B149D94CD:0000000000000004:0000000000000000 bits:13106791 flag<br>
s:0<br>May 21 18:19:04 s6 kernel: block drbd0: uuid_compare()=-1 by rule 50<br>May 21 18:19:04 s6 kernel: block drbd0: Becoming sync target due to disk states.<br>May 21 18:19:04 s6 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )<br>
<br>May 21 18:19:05 s6 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) <br>May 21 18:19:05 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0<br>May 21 18:19:05 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 3 (0x300)<br>
May 21 18:19:05 s6 kernel: block drbd0: before-resync-target handler returned 3, dropping connection.<br>May 21 18:19:05 s6 kernel: block drbd0: peer( Primary -> Unknown ) conn( WFSyncUUID -> Disconnecting ) pdsk( UpToDate -> DUnknown )<br>
<br>May 21 18:19:05 s6 kernel: block drbd0: asender terminated<br>May 21 18:19:05 s6 kernel: block drbd0: Terminating asender thread<br>May 21 18:19:05 s6 kernel: block drbd0: Connection closed<br>May 21 18:19:05 s6 kernel: block drbd0: conn( Disconnecting -> StandAlone ) <br>
May 21 18:19:05 s6 kernel: block drbd0: receiver terminated<br>May 21 18:19:05 s6 kernel: block drbd0: Terminating receiver thread<br>May 21 18:19:13 s6 kernel: block drbd0: disk( Inconsistent -> Diskless ) <br>May 21 18:19:13 s6 kernel: block drbd0: drbd_bm_resize called with capacity == 0<br>
May 21 18:19:13 s6 kernel: block drbd0: worker terminated<br>May 21 18:19:13 s6 kernel: block drbd0: Terminating worker thread<br><br><br>Other times I can no see sync progress, because it fails with this output:<br><br><br>
May 21 17:44:22 s6 kernel: block drbd0: Starting worker thread (from cqueue [229])<br>
May 21 17:44:22 s6 kernel: block drbd0: disk( Diskless -> Attaching ) <br>
May 21 17:44:22 s6 kernel: block drbd0: No usable activity log found.<br>
May 21 17:44:22 s6 kernel: block drbd0: Method to ensure write ordering: barrier<br>
May 21 17:44:22 s6 kernel: block drbd0: Backing device's merge_bvec_fn() = ffffffff8083cb50<br>
May 21 17:44:22 s6 kernel: block drbd0: max_segment_size ( = BIO size ) = 4096<br>
May 21 17:44:22 s6 kernel: block drbd0: drbd_bm_resize called with capacity == 104854328<br>
May 21 17:44:22 s6 kernel: block drbd0: resync bitmap: bits=13106791 words=204794<br>
May 21 17:44:22 s6 kernel: block drbd0: size = 50 GB (52427164 KB)<br>
May 21 17:44:22 s6 kernel: block drbd0: Writing the whole bitmap, size changed<br>
May 21 17:44:22 s6 kernel: block drbd0: 50 GB (13106791 bits) marked out-of-sync by on disk bit-map.<br>
May 21 17:44:22 s6 kernel: block drbd0: recounting of set bits took additional 0 jiffies<br>
May 21 17:44:22 s6 kernel: block drbd0: 50 GB (13106791 bits) marked out-of-sync by on disk bit-map.<br>
May 21 17:44:22 s6 kernel: block drbd0: disk( Attaching -> Inconsistent ) <br>
May 21 17:45:30 s6 kernel: block drbd0: conn( StandAlone -> Unconnected ) <br>
May 21 17:45:30 s6 kernel: block drbd0: Starting receiver thread (from drbd0_worker [11346])<br>
May 21 17:45:30 s6 kernel: block drbd0: receiver (re)started<br>
May 21 17:45:30 s6 kernel: block drbd0: conn( Unconnected -> WFConnection ) <br>
May 21 17:45:31 s6 kernel: block drbd0: Handshake successful: Agreed network protocol version 91<br>
May 21 17:45:31 s6 kernel: block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC<br>
May 21 17:45:31 s6 kernel: block drbd0: conn( WFConnection -> WFReportParams ) <br>
May 21 17:45:31 s6 kernel: block drbd0: Starting asender thread (from drbd0_receiver [11772])<br>
May 21 17:45:31 s6 kernel: block drbd0: data-integrity-alg: <not-used><br>
May 21 17:45:31 s6 kernel: block drbd0: drbd_sync_handshake:<br>
May 21 17:45:31 s6 kernel: block drbd0: self
0000000000000004:0000000000000000:0000000000000000:0000000000000000
bits:13106791 flags:0<br>
May 21 17:45:31 s6 kernel: block drbd0: peer
EEA95651B1BD8559:0000000000000004:0000000000000000:0000000000000000
bits:13106791 flags:0<br>
May 21 17:45:31 s6 kernel: block drbd0: uuid_compare()=-2 by rule 20<br>
May 21 17:45:31 s6 kernel: block drbd0: Becoming sync target due to disk states.<br>
May 21 17:45:31 s6 kernel: block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.<br>
May 21 17:45:31 s6 kernel: block drbd0: 50 GB (13106791 bits) marked out-of-sync by on disk bit-map.<br>
May 21 17:45:31 s6 kernel: block drbd0: peer( Unknown -> Primary )
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )<br>
May 21 17:45:31 s6 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) <br>
May 21 17:45:31 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0<br>
May 21 17:45:31 s6 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 3 (0x300)<br>
May 21 17:45:31 s6 kernel: block drbd0: before-resync-target handler returned 3, dropping connection.<br>
May 21 17:45:31 s6 kernel: block drbd0: peer( Primary -> Unknown )
conn( WFSyncUUID -> Disconnecting ) pdsk( UpToDate -> DUnknown )<br>
<br><br>Im a bit stranged about this, cause others nodes works well, so I don't know if it could be a network card or disk fail.<br><br>Thanks!<br><br>