Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear All, I have two cluster nodes, for which I use DRBD 8.3 (compiled and installed by me as rpm) as the shared block device. The two identical systems comprise of kernel 2.6.18-92.1.22.el5.centos.plus, drbd-8.3.0-3, drbd-km-2.6.18_92.1.22.el5.centos.plus-8.3.0-3 On top of DRBD resources, I have LVM (clustered) and on top of that GFS2. The issue is the following. I run both nodes as Primary/Primary. On both nodes, various applications run, that concurrently write on the filesystem (however not on the same files or even directories). I get randomly but constantly (i.e. it always happens at least once per day) the following errors: On node 1: ........ drbd1: susp( 1 -> 0 ) drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 9 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 8 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 7 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 6 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 5 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 4 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 3 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 2 drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 1 drbd1: peer( Primary -> Unknown ) conn( WFBitMapS -> Timeout ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) drbd1: short sent ReportBitMap size=4096 sent=276 drbd1: short read expecting header on sock: r=-512 drbd1: asender terminated drbd1: Terminating asender thread drbd1: Connection closed drbd1: helper command: /sbin/drbdadm fence-peer minor-1 drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 2 (0x200) drbd1: fence-peer helper broken, returned 2 drbd1: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' drbd1: old = { cs:Timeout ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: conn( Timeout -> Unconnected ) drbd1: receiver terminated drbd1: Restarting receiver thread drbd1: receiver (re)started drbd1: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' drbd1: old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: conn( Unconnected -> WFConnection ) ............... [root at tweety-1 ~]# drbdadm status <drbd-status version="8.3.0" api="88"> <resources config_file="/etc/drbd.conf"> <resource minor="0" name="r0" cs="Connected" ro1="Primary" ro2="Primary" ds1="UpToDate" ds2="UpToDate" /> <resource minor="1" name="r1" cs="WFConnection" ro1="Primary" ro2="Unknown" ds1="UpToDate" ds2="DUnknown" suspended /> </resources> </drbd-status> On node 2: .......... drbd1: sock was reset by peer drbd1: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) drbd1: short read expecting header on sock: r=-104 drbd1: meta connection shut down by peer. drbd1: asender terminated drbd1: Terminating asender thread drbd1: Creating new current UUID drbd1: Connection closed drbd1: helper command: /sbin/drbdadm fence-peer minor-1 drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 2 (0x200) drbd1: fence-peer helper broken, returned 2 drbd1: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' drbd1: old = { cs:BrokenPipe ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: conn( BrokenPipe -> Unconnected ) drbd1: receiver terminated drbd1: Restarting receiver thread drbd1: receiver (re)started drbd1: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' drbd1: old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: conn( Unconnected -> WFConnection ) drbd1: Handshake successful: Agreed network protocol version 89 drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC drbd1: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated' drbd1: old = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: new = { cs:WFReportParams ro:Primary/Unknown ds:UpToDate/DUnknown s--- } drbd1: conn( WFConnection -> WFReportParams ) drbd1: Starting asender thread (from drbd1_receiver [4858]) drbd1: data-integrity-alg: crc32c drbd1: meta connection shut down by peer. drbd1: conn( WFReportParams -> NetworkFailure ) drbd1: asender terminated drbd1: Terminating asender thread .......... [root at tweety-2 ~]# drbdadm status <drbd-status version="8.3.0" api="88"> <resources config_file="/etc/drbd.conf"> <resource minor="0" name="r0" cs="Connected" ro1="Primary" ro2="Primary" ds1="UpToDate" ds2="UpToDate" /> <resource minor="1" name="r1" cs="NetworkFailure" ro1="Primary" ro2="Unknown" ds1="UpToDate" ds2="DUnknown" suspended /> </resources> </drbd-status> My drbd.conf is the following: global { # minor-count 64; # dialog-refresh 5; # 5 seconds # disable-ip-verification; usage-count yes; } common { protocol C; syncer { rate 100M; al-extents 257; } handlers { pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/sbin/obliterate"; pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f"; split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; } startup { wfc-timeout 100; degr-wfc-timeout 60; # 1 minutes. #wait-after-sb; become-primary-on both; } disk { #on-io-error pass-on; fencing resource-and-stonith; } net { sndbuf-size 512k; timeout 60; # (unit = 0.1 seconds) connect-int 10; # (unit = 1 second) ping-int 10; # (unit = 1 second) ping-timeout 50; # (unit = 0.1 seconds) max-buffers 2048; max-epoch-size 2048; ko-count 10; allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "tweety"; after-sb-0pri discard-least-changes; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; rr-conflict call-pri-lost; data-integrity-alg "crc32c"; } } resource r0 { device /dev/drbd0; disk /dev/hda4; meta-disk internal; on tweety-1 { address 10.254.254.253:7788; } on tweety-2 { address 10.254.254.254:7788; } } resource r1 { device /dev/drbd1; disk /dev/hdb4; meta-disk internal; on tweety-1 { address 10.254.254.253:7789; } on tweety-2 { address 10.254.254.254:7789; } } I have no idea what this is, and googling did not help. Obviously this error turns the cluster useless. The processes get Demonized, and since no fencing is performed (is that related to the above errors???!!!), manual intervention is needed. Could someone be kind enough to share his knowledge with me, on what the problem is, what might cause it and how to solve it? Thank you All for your time. Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090317/8b36fe70/attachment.htm>