Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
The situation: drbd 0.7.10 Fedora Core 1 kernel 2.4.26 Osacon2 is the active member of the heartbeat based cluster. A STONITH event is triggered by unplugging both heartbeat cables (including the drbd sync cable). The STONITH is successful in failing over the cluster and osacon1 becomes the active member of the cluster. When osacon2 loads the drbd service during the boot process, osacon1 refuses the connection and goes into StandAlone cstate. This causes the cluster to no longer have redundancy. To fix this requires a drbdsetup /dev/drbd0 net command to be run on osacon1 and the drbd service restarted on osacon2. Not a very automated process. The question: Why does drbd come up and error out and be left in standalone cstate? Shouldn't the state of drbd on osacon2 be secondary as it loads and therefore not cause an error when it tries to sync with osacon1? Is there some way to avoid this event in this scenario? It is completely reproducible and severely degrades the redundancy of this cluster. If the drbd sync cable is unplugged, then re-plugged in a minute later there is no problems with re-establishing the drbd connection. The problem only occurs if the Primary is rebooted and before it comes back online the other node becomes the Primary. The following logs are from the event where the drbd service stars on osacon2 after the STONITH event. Configuration file follows the log entries. *** Active cluster member Feb 19 16:58:44 osacon1 kernel: drbd0: drbd0_receiver [3372]: cstate WFConnection --> WFReportParams Feb 19 16:58:44 osacon1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Feb 19 16:58:44 osacon1 kernel: drbd0: Connection established. Feb 19 16:58:44 osacon1 kernel: drbd0: I am(P): 1:00000002:00000001:00000009:00000003:10 Feb 19 16:58:44 osacon1 kernel: drbd0: Peer(S): 1:00000002:00000001:0000000a:00000002:10 Feb 19 16:58:44 osacon1 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Feb 19 16:58:44 osacon1 kernel: drbd0: drbd0_receiver [3372]: cstate WFReportParams --> StandAlone Feb 19 16:58:44 osacon1 kernel: drbd0: error receiving ReportParams, l: 72! Feb 19 16:58:44 osacon1 kernel: drbd0: asender terminated Feb 19 16:58:44 osacon1 kernel: drbd0: worker terminated Feb 19 16:58:44 osacon1 kernel: drbd0: drbd0_receiver [3372]: cstate StandAlone --> StandAlone Feb 19 16:58:44 osacon1 kernel: drbd0: Connection lost. Feb 19 16:58:44 osacon1 kernel: drbd0: receiver terminated *** Passive cluster member booting up after a STONITH event (was active before STONITH) Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_receiver [1574]: cstate WFConnection --> WFReportParams Feb 19 16:58:44 osacon2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Feb 19 16:58:44 osacon2 kernel: drbd0: Connection established. Feb 19 16:58:44 osacon2 kernel: drbd0: I am(S): 1:00000002:00000001:0000000a:00000002:10 Feb 19 16:58:44 osacon2 kernel: drbd0: Peer(P): 1:00000002:00000001:00000009:00000003:10 Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_receiver [1574]: cstate WFReportParams --> WFBitMapS Feb 19 16:58:44 osacon2 kernel: drbd0: meta connection shut down by peer. Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_asender [1606]: cstate WFBitMapS --> NetworkFailure Feb 19 16:58:44 osacon2 kernel: drbd0: asender terminated Feb 19 16:58:44 osacon2 drbd: WARN: stdin/stdout is not a TTY; using /dev/console Feb 19 16:58:44 osacon2 kernel: drbd0: sock_sendmsg returned -104 Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_receiver [1574]: cstate NetworkFailure --> BrokenPipe Feb 19 16:58:44 osacon2 kernel: drbd0: short sent ReportBitMap size=4096 sent=3800 Feb 19 16:58:44 osacon2 rc: Starting drbd: succeeded Feb 19 16:58:44 osacon2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Feb 19 16:58:44 osacon2 kernel: drbd0: sock was shut down by peer Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_receiver [1574]: cstate BrokenPipe --> BrokenPipe Feb 19 16:58:44 osacon2 kernel: drbd0: short read expecting header on sock: r=0 Feb 19 16:58:44 osacon2 kernel: drbd0: worker terminated Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_receiver [1574]: cstate BrokenPipe --> Unconnected Feb 19 16:58:44 osacon2 kernel: drbd0: Connection lost. Feb 19 16:58:44 osacon2 kernel: drbd0: drbd0_receiver [1574]: cstate Unconnected --> WFConnection /etc/drbd.conf resource drbd0 { protocol C; incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; } net { on-disconnect reconnect; } syncer { rate 15M; group 1; al-extents 257; } on osacon1.osa.int { device /dev/drbd0; disk /dev/hda6; address 10.127.0.2:7788; meta-disk internal; } on osacon2.osa.int { device /dev/drbd0; disk /dev/hda6; address 10.127.0.3:7788; meta-disk internal; } } Thanks, JT