[DRBD-user] init script panicss but drbd works

Sat Feb 26 15:40:34 CET 2005

Hello all,

First post here and first experience with drbd. Using DRBD v.0.6.12
(because it's the latest version marked as stable on Gentoo -- plan is to
acquire some experience with it, then try drbd 0.7x).
Running Gentoo-hardened kernel version 2.4.28 on i686. Equipment is
identical on both nodes. 3 nics on each, one each class A, B and C, plus
serial connection.
N.B.: Gentoo uses devfs.

Problem: Cannot work out why kernel panics upon starting and stopping the
init script. The init script and drbd.conf are copied in below.

Observations: drbd appears to work satisfactorily when started according
to the sequence of steps using drbdsetup, as described in the manual
(http://www.slackworks.com/~dkrovich/DRBD/usingdrbdsetup.html). drbd also
starts with "/drbd start" on the command line after loading the drbd
kernel module. However, the following detail may indicate that all is not
be working optimally:

* during a full sync "cat /proc/drbd" on node 1 reports "0 - cs: Connected
st: Primary/Secondary" but "1 - cs: Unconfigured". However, on node 2 "cat
/proc/drbd" reports "0 - cs: Connected st: Secondary/Primary". Why the
"Unconfigured" instead of a "WF" on node 1 as we expect from reading the
docs?

On the other hand:

* syncing speed is up to 12.5MB as docs state it should be and "cat
/proc/drbd" reports sync progress on both nodes.

* there appear to have been no problems mounting and using the file system
during 2 weeks of constant active use.

* running md5sum on fully synced disks produces identical results.

Here is our drbd.conf:
resource drbd0 {
  protocol=C
  fsckcmd=/bin/true

  disk {
        do-panic
        disk-size=39078112
  }
  net {
          sync-max=8M # bytes/sec
        timeout=60
        connect-int=10
        ping-int=10
  }
  on ns1 {
          device=/dev/nbd/0
        disk=/dev/hdd1
        address=10.0.0.1
        port=7789
  }
  on ns2 {
          device=/dev/nbd/0
        disk=/dev/hdd1
        address=10.0.0.2
        port=7789
  }
}

And the Gentoo-installed init script:
#!/sbin/runscript

depend() {
  need   net
  before heartbeat
  after  sshd           # In case there are sync problems
}

start() {
  ebegin "Starting drbd mirror driver"
  ${DRBD} ${DRBDDEV} start
  if [ "$?" == "1" ]; then      # In case you decide this
    eend 0                      # node is primary the
  fi                            # script returns 1
  eend $?
}

stop() {
  ebegin "Stopping drbd mirror driver"
  ${DRBD} ${DRBDDEV} stop
  eend $?

Trying to get drbd to start via the init script is necessary for
heartbeat. When heartbeat's init script is run after starting drbd as
described above the following log is produced:
Feb 22 12:19:01 [heartbeat] info: **************************
Feb 22 12:19:01 [heartbeat] info: Configuration validated. Starting
heartbeat 1.2.3
Feb 22 12:19:01 [heartbeat] info: heartbeat: version 1.2.3
Feb 22 12:19:01 [heartbeat] info: Heartbeat generation: 3
Feb 22 12:19:01 [heartbeat] info: Starting serial heartbeat on tty
/dev/ttyS0 (19200 baud)
Feb 22 12:19:01 [heartbeat] info: UDP Broadcast heartbeat started on port
694 (694) interface eth2
Feb 22 12:19:01 [heartbeat] info: ping heartbeat started.
Feb 22 12:19:01 [heartbeat] info: pid 15074 locked in memory.
Feb 22 12:19:01 [heartbeat] info: pid 15075 locked in memory.
Feb 22 12:19:01 [heartbeat] info: pid 15076 locked in memory.
Feb 22 12:19:01 [heartbeat] info: pid 15078 locked in memory.
Feb 22 12:19:01 [heartbeat] info: pid 15079 locked in memory.
Feb 22 12:19:01 [heartbeat] info: pid 15036 locked in memory.
Feb 22 12:19:01 [heartbeat] info: Local status now set to: 'up'
Feb 22 12:19:02 [heartbeat] info: pid 15077 locked in memory.
Feb 22 12:19:02 [heartbeat] info: Link ns1:eth2 up.
Feb 22 12:19:02 [heartbeat] info: pid 15080 locked in memory.
Feb 22 12:19:02 [heartbeat] info: Link 191.250.1.1:191.250.1.1 up.
Feb 22 12:19:02 [heartbeat] info: Status update for node 191.250.1.1:
status ping
Feb 22 12:19:58 [heartbeat] WARN: TTY write timeout on [/dev/ttyS0] (no
connection or bad cable? [see documentation])
Feb 22 12:21:02 [heartbeat] WARN: node ns2: is dead
Feb 22 12:21:02 [heartbeat] info: Local status now set to: 'active'
Feb 22 12:21:02 [heartbeat] info: Starting child client
"/usr/lib/heartbeat/ipfail" (65,65)
Feb 22 12:21:02 [heartbeat] WARN: No STONITH device configured.
Feb 22 12:21:02 [heartbeat] WARN: Shared disks are not protected.
Feb 22 12:21:02 [heartbeat] info: Resources being acquired from ns2.
Feb 22 12:21:02 [heartbeat] info: Starting "/usr/lib/heartbeat/ipfail" as
uid 65  gid 65 (pid 15192)
Feb 22 12:21:02 [heartbeat] debug: notify_world: setting SIGCHLD Handler
to SIG_DFL
Feb 22 12:21:02 [heartbeat] debug: StartNextRemoteRscReq(): child count 1
                - Last output repeated twice -
Feb 22 12:21:03 [heartbeat] info: Local Resource acquisition completed.
Feb 22 12:21:03 [heartbeat] info: Initial resource acquisition complete
(T_RESOURCES(us))
Feb 22 12:21:03 [heartbeat] debug: notify_world: setting SIGCHLD Handler
to SIG_DFL
Feb 22 12:21:14 [heartbeat] info: Local Resource acquisition completed.
(none)
Feb 22 12:21:14 [heartbeat] info: local resource transition completed.
Feb 22 12:21:41 [heartbeat] WARN: Shutdown delayed until current resource
activity finishes.

Many thanks for all observations received.

Barry Schatz