Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello all I have RHEL 5.3 (CentOS), drbd 8.3.1, 2.6.18-164.el5xen kernel. I have set up a two nodes cluster running among other services, BOINC on both nodes. For some reason BOINC always makes both nodes to crash / being fenced after a few hours of operation. However the issue is that ALWAYS after reboot, drbd (which runs the cluster storage) does not recover. This is the log from node 1: ............... Oct 11 11:59:42 localhost kernel: drbd2: conn( WFBitMapT -> WFSyncUUID ) Oct 11 11:59:42 localhost clurgmgrd: [4307]: <info> Executing /etc/init.d/drbd status Oct 11 11:59:42 localhost kernel: drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 Oct 11 11:59:42 localhost kernel: drbd2: helper command: /sbin/drbdadm before-resync-target minor-2 exit code 0 (0x0) Oct 11 11:59:42 localhost kernel: drbd2: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Oct 11 11:59:42 localhost kernel: drbd2: Began resync as SyncTarget (will sync 2031616 KB [507904 bits set]). Oct 11 11:59:43 localhost kernel: drbd1: peer( Secondary -> Primary ) Oct 11 11:59:43 localhost kernel: drbd2: peer( Secondary -> Primary ) Oct 11 11:59:43 localhost kernel: drbd0: role( Secondary -> Primary ) Oct 11 11:59:43 localhost kernel: drbd1: role( Secondary -> Primary ) Oct 11 11:59:44 localhost kernel: drbd2: role( Secondary -> Primary ) Oct 11 11:59:44 localhost kernel: drbd2: Resync done (total 45 sec; paused 0 sec; 45144 K/sec) Oct 11 11:59:44 localhost kernel: drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Oct 11 11:59:45 localhost kernel: drbd2: helper command: /sbin/drbdadm after-resync-target minor-2 Oct 11 11:59:46 localhost kernel: drbd2: helper command: /sbin/drbdadm after-resync-target minor-2 exit code 0 (0x0) Oct 11 12:00:08 localhost kernel: drbd0: Resync done (total 84 sec; paused 0 sec; 12824 K/sec) Oct 11 12:00:08 localhost kernel: drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Oct 11 12:00:08 localhost kernel: drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 Oct 11 12:00:08 localhost kernel: drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) Oct 11 12:00:13 localhost clurgmgrd: [4307]: <info> Executing /etc/init.d/drbd status Oct 11 12:00:43 localhost clurgmgrd: [4307]: <info> Executing /etc/init.d/drbd status Oct 11 12:01:13 localhost clurgmgrd: [4307]: <info> Executing /etc/init.d/drbd status Oct 11 12:01:33 localhost kernel: drbd0: peer( Primary -> Secondary ) Oct 11 12:01:33 localhost kernel: drbd0: State change failed: Refusing to be Primary while peer is not outdated Oct 11 12:01:33 localhost kernel: drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r--- } Oct 11 12:01:33 localhost kernel: drbd0: wanted = { cs:TearDown ro:Primary/Unknown ds:UpToDate/DUnknown s--- } Oct 11 12:01:33 localhost kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) Oct 11 12:01:33 localhost kernel: drbd0: Creating new current UUID Oct 11 12:01:33 localhost kernel: drbd0: meta connection shut down by peer. Oct 11 12:01:33 localhost kernel: drbd0: asender terminated Oct 11 12:01:33 localhost kernel: drbd0: Terminating asender thread Oct 11 12:01:34 localhost kernel: drbd0: Connection closed Oct 11 12:01:34 localhost kernel: drbd0: conn( TearDown -> Unconnected ) Oct 11 12:01:34 localhost kernel: drbd0: receiver terminated Oct 11 12:01:34 localhost kernel: drbd0: Restarting receiver thread Oct 11 12:01:34 localhost kernel: drbd0: receiver (re)started Oct 11 12:01:34 localhost kernel: drbd0: conn( Unconnected -> WFConnection ) Oct 11 12:01:34 localhost kernel: drbd1: peer( Primary -> Secondary ) Oct 11 12:01:34 localhost kernel: drbd1: State change failed: Refusing to be Primary without at least one UpToDate disk Oct 11 12:01:34 localhost kernel: drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r--- } Oct 11 12:01:34 localhost kernel: drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s--- } Oct 11 12:01:34 localhost kernel: drbd1: State change failed: Refusing to be Primary without at least one UpToDate disk Oct 11 12:01:34 localhost kernel: drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r--- } Oct 11 12:01:34 localhost kernel: drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r--- } Oct 11 12:01:34 localhost kernel: drbd2: peer( Primary -> Secondary ) Oct 11 12:01:34 localhost kernel: drbd2: State change failed: Refusing to be Primary while peer is not outdated Oct 11 12:01:34 localhost kernel: drbd2: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r--- } Oct 11 12:01:34 localhost kernel: drbd2: wanted = { cs:TearDown ro:Primary/Unknown ds:UpToDate/DUnknown s--- } Oct 11 12:01:34 localhost kernel: drbd2: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) Oct 11 12:01:34 localhost kernel: drbd2: Creating new current UUID Oct 11 12:01:34 localhost kernel: drbd2: meta connection shut down by peer. Oct 11 12:01:34 localhost kernel: drbd2: asender terminated Oct 11 12:01:34 localhost kernel: drbd2: Terminating asender thread Oct 11 12:01:34 localhost kernel: drbd2: Connection closed Oct 11 12:01:34 localhost kernel: drbd2: conn( TearDown -> Unconnected ) Oct 11 12:01:34 localhost kernel: drbd2: receiver terminated Oct 11 12:01:34 localhost kernel: drbd2: Restarting receiver thread Oct 11 12:01:34 localhost kernel: drbd2: receiver (re)started Oct 11 12:01:34 localhost kernel: drbd2: conn( Unconnected -> WFConnection ) Oct 11 12:01:38 localhost kernel: drbd1: State change failed: Refusing to be Primary without at least one UpToDate disk Oct 11 12:01:38 localhost kernel: drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r--- } Oct 11 12:01:38 localhost kernel: drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/DUnknown s--- } Oct 11 12:01:38 localhost kernel: drbd1: State change failed: Refusing to be Primary without at least one UpToDate disk Oct 11 12:01:38 localhost kernel: drbd1: state = { cs:SyncTarget ro:Primary/Secondary ds:Inconsistent/UpToDate r--- } Oct 11 12:01:38 localhost kernel: drbd1: wanted = { cs:TearDown ro:Primary/Unknown ds:Inconsistent/Outdated r--- } And this is the log from node 2: ......... drbd1: conn( WFBitMapS -> SyncSource ) drbd1: Began resync as SyncSource (will sync 300165324 KB [75041331 bits set]). drbd2: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) drbd2: Began resync as SyncSource (will sync 2031616 KB [507904 bits set]). drbd1: role( Secondary -> Primary ) drbd2: role( Secondary -> Primary ) drbd0: peer( Secondary -> Primary ) drbd1: peer( Secondary -> Primary ) drbd2: peer( Secondary -> Primary ) virbr0: no IPv6 routers present SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs drbd2: Resync done (total 45 sec; paused 0 sec; 45144 K/sec) drbd2: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) eth1: too many iterations (16) in nv_nic_irq. drbd0: Resync done (total 84 sec; paused 0 sec; 12824 K/sec) drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) dlm: Using TCP for communications dlm: got connection from 2 drbd0: role( Primary -> Secondary ) drbd0: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7) drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown ) drbd0: short read expecting header on sock: r=-512 drbd0: asender terminated drbd0: Terminating asender thread drbd0: Connection closed drbd0: conn( Disconnecting -> StandAlone ) drbd0: receiver terminated drbd0: Terminating receiver thread drbd0: disk( Outdated -> Diskless ) drbd0: drbd_bm_resize called with capacity == 0 drbd0: worker terminated drbd0: Terminating worker thread drbd1: role( Primary -> Secondary ) drbd1: Requested state change failed by peer: Refusing to be Primary without at least one UpToDate disk (-2) drbd1: Requested state change failed by peer: Refusing to be Primary without at least one UpToDate disk (-2) drbd1: State change failed: Refusing to be inconsistent on both nodes drbd1: state = { cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent r--- } drbd1: wanted = { cs:SyncSource ro:Secondary/Primary ds:Diskless/Inconsistent r--- } drbd2: role( Primary -> Secondary ) drbd2: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7) drbd2: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown ) drbd2: short read expecting header on sock: r=-512 drbd2: asender terminated drbd2: Terminating asender thread drbd2: Connection closed drbd2: conn( Disconnecting -> StandAlone ) drbd2: receiver terminated drbd2: Terminating receiver thread drbd2: disk( Outdated -> Diskless ) drbd2: drbd_bm_resize called with capacity == 0 drbd2: worker terminated drbd2: Terminating worker thread drbd1: Requested state change failed by peer: Refusing to be Primary without at least one UpToDate disk (-2) drbd1: Requested state change failed by peer: Refusing to be Primary without at least one UpToDate disk (-2) drbd1: State change failed: Refusing to be inconsistent on both nodes drbd1: state = { cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent r--- } drbd1: wanted = { cs:SyncSource ro:Secondary/Primary ds:Diskless/Inconsistent r--- } This is my config: global { # minor-count 64; # dialog-refresh 5; # 5 seconds # disable-ip-verification; usage-count yes; } common { protocol C; syncer { rate 100M; al-extents 257; } handlers { pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/sbin/obliterate"; pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f"; split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; } startup { wfc-timeout 100; degr-wfc-timeout 60; # 1 minutes. become-primary-on both; } disk { fencing resource-and-stonith; } net { sndbuf-size 512k; timeout 60; # 6 seconds (unit = 0.1 seconds) connect-int 10; # 10 seconds (unit = 1 second) ping-int 10; # 10 seconds (unit = 1 second) ping-timeout 50; # 500 ms (unit = 0.1 seconds) max-buffers 2048; max-epoch-size 2048; ko-count 10; allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "*****"; after-sb-0pri discard-least-changes; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; rr-conflict call-pri-lost; data-integrity-alg "crc32c"; } } resource r0 { device /dev/drbd0; disk /dev/hda4; meta-disk internal; on tweety-1 { address 10.254.254.253:7788; } on tweety-2 { address 10.254.254.254:7788; } } resource r1 { device /dev/drbd1; disk /dev/hdb4; meta-disk internal; on tweety-1 { address 10.254.254.253:7789; } on tweety-2 { address 10.254.254.254:7789; } } resource r2 { device /dev/drbd2; disk /dev/sda1; meta-disk internal; on tweety-1 { address 10.254.254.253:7790; } on tweety-2 { address 10.254.254.254:7790; } } This is the current status of drdb (after the above logged crash) version: 8.3.1 (api:88/proto:86-89) GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by root at tweety-1, 2009-09-21 17:23:59 0: cs:Unconfigured 1: cs:SyncSource ro:Secondary/Primary ds:UpToDate/Inconsistent C r--- ns:64823832 nr:0 dw:0 dr:64831176 al:0 bm:4089 lo:35 pe:83 ua:256 ap:0 ep:1 wo:b oos:235344204 [===>................] sync'ed: 21.6% (229828/293128)M finish: 14:51:27 speed: 4,064 (45,264) K/sec 2: cs:Unconfigured Can someone give some food for thoughts on I have done wrong? Thank you All for your time, Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091011/8cae21ff/attachment.htm>