[DRBD-user] Problem with failing resource Primary/Primary - sock_sendmsg time expired

Theophanis Kontogiannis theophanis_kontogiannis at yahoo.gr
Tue Mar 17 22:40:27 CET 2009

Previous message: [DRBD-user] Recommendation on shared storage
Next message: [DRBD-user] no-disk-flushes ineffective?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

Dear All,

 

I have two cluster nodes, for which I use DRBD 8.3 (compiled and installed
by me as rpm) as the shared block device.

 

The two identical systems comprise of kernel 2.6.18-92.1.22.el5.centos.plus,
drbd-8.3.0-3,   drbd-km-2.6.18_92.1.22.el5.centos.plus-8.3.0-3

 

On top of DRBD resources, I have LVM (clustered) and on top of that GFS2.

 

The issue is the following.

 

I run both nodes as Primary/Primary. On both nodes, various applications
run, that concurrently write on the filesystem (however not on the same
files or even directories). I get randomly but constantly (i.e. it always
happens at least once per day) the following errors:

 

On node 1:

 

........

drbd1: susp( 1 -> 0 )

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 9

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 8

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 7

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 6

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 5

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 4

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 3

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 2

drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 1

drbd1: peer( Primary -> Unknown ) conn( WFBitMapS -> Timeout ) pdsk(
UpToDate -> DUnknown ) susp( 0 -> 1 )

drbd1: short sent ReportBitMap size=4096 sent=276

drbd1: short read expecting header on sock: r=-512

drbd1: asender terminated

drbd1: Terminating asender thread

drbd1: Connection closed

drbd1: helper command: /sbin/drbdadm fence-peer minor-1

drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 2 (0x200)

drbd1: fence-peer helper broken, returned 2

drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'

drbd1:  old = { cs:Timeout ro:Primary/Unknown ds:UpToDate/DUnknown s--- }

drbd1:  new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1: conn( Timeout -> Unconnected )

drbd1: receiver terminated

drbd1: Restarting receiver thread

drbd1: receiver (re)started

drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'

drbd1:  old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1:  new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1: conn( Unconnected -> WFConnection )

...............

 

[root at tweety-1 ~]# drbdadm status 

<drbd-status version="8.3.0" api="88">

<resources config_file="/etc/drbd.conf">

<resource minor="0" name="r0" cs="Connected" ro1="Primary" ro2="Primary"
ds1="UpToDate" ds2="UpToDate" />

<resource minor="1" name="r1" cs="WFConnection" ro1="Primary" ro2="Unknown"
ds1="UpToDate" ds2="DUnknown" suspended />

</resources>

</drbd-status>

 

On node 2:

 

..........

drbd1: sock was reset by peer

drbd1: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk(
UpToDate -> DUnknown ) susp( 0 -> 1 )

drbd1: short read expecting header on sock: r=-104

drbd1: meta connection shut down by peer.

drbd1: asender terminated

drbd1: Terminating asender thread

drbd1: Creating new current UUID

drbd1: Connection closed

drbd1: helper command: /sbin/drbdadm fence-peer minor-1

drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 2 (0x200)

drbd1: fence-peer helper broken, returned 2

drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'

drbd1:  old = { cs:BrokenPipe ro:Primary/Unknown ds:UpToDate/DUnknown s--- }

drbd1:  new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1: conn( BrokenPipe -> Unconnected )

drbd1: receiver terminated

drbd1: Restarting receiver thread

drbd1: receiver (re)started

drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'

drbd1:  old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1:  new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1: conn( Unconnected -> WFConnection )

drbd1: Handshake successful: Agreed network protocol version 89

drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC

drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'

drbd1:  old = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s---
}

drbd1:  new = { cs:WFReportParams ro:Primary/Unknown ds:UpToDate/DUnknown
s--- }

drbd1: conn( WFConnection -> WFReportParams )

drbd1: Starting asender thread (from drbd1_receiver [4858])

drbd1: data-integrity-alg: crc32c

drbd1: meta connection shut down by peer.

drbd1: conn( WFReportParams -> NetworkFailure )

drbd1: asender terminated

drbd1: Terminating asender thread

..........

 

 

[root at tweety-2 ~]# drbdadm status

<drbd-status version="8.3.0" api="88">

<resources config_file="/etc/drbd.conf">

<resource minor="0" name="r0" cs="Connected" ro1="Primary" ro2="Primary"
ds1="UpToDate" ds2="UpToDate" />

<resource minor="1" name="r1" cs="NetworkFailure" ro1="Primary"
ro2="Unknown" ds1="UpToDate" ds2="DUnknown" suspended />

</resources>

</drbd-status>

 

My drbd.conf is the following:

 

global {

    # minor-count 64;

    # dialog-refresh 5; # 5 seconds

    # disable-ip-verification;

    usage-count yes;

}

 

 

common {

  protocol C;

 

  syncer {

    rate 100M;

    al-extents 257;

  }

 

 

 handlers {

    pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";

    pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";

    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

    outdate-peer "/sbin/obliterate";

    pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD
Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";

    split-brain "echo split-brain. drbdadm -- --discard-my-data connect
$DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";

  }

 

  startup {

    wfc-timeout  100;

    degr-wfc-timeout 60;    # 1 minutes.

    #wait-after-sb;

    become-primary-on both;

  }

 

  disk {

    #on-io-error   pass-on;

    fencing resource-and-stonith;

  }

 

  net {

    sndbuf-size 512k;

    timeout       60;    #   (unit = 0.1 seconds)

    connect-int   10;    #   (unit = 1 second)

    ping-int      10;    #   (unit = 1 second)

    ping-timeout  50;    #  (unit = 0.1 seconds)

    max-buffers     2048;

    max-epoch-size  2048;

    ko-count 10;

    allow-two-primaries;

    cram-hmac-alg "sha1";

    shared-secret "tweety";

    after-sb-0pri discard-least-changes;

    after-sb-1pri violently-as0p;

    after-sb-2pri violently-as0p;

    rr-conflict call-pri-lost;

    data-integrity-alg "crc32c";

 

  }

}

 

resource r0 {

        device          /dev/drbd0;

        disk            /dev/hda4;

        meta-disk       internal;

 on tweety-1 { address   10.254.254.253:7788; }

 on tweety-2 { address   10.254.254.254:7788; }

}

 

resource r1 {

        device        /dev/drbd1;

        disk          /dev/hdb4;

        meta-disk     internal;

  on tweety-1 { address  10.254.254.253:7789; }

  on tweety-2 { address  10.254.254.254:7789; }

}

 

I have no idea what this is, and googling did not help.

 

Obviously this error turns the cluster useless. 

The processes get Demonized, and since no fencing is performed (is that
related to the above errors???!!!), manual intervention is needed.

 

Could someone be kind enough to share his knowledge with me, on what the
problem is, what might cause it and how to solve it?

 

Thank you All for your time.

 

Theophanis Kontogiannis

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090317/8b36fe70/attachment.htm>

Previous message: [DRBD-user] Recommendation on shared storage
Next message: [DRBD-user] no-disk-flushes ineffective?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the drbd-user mailing list