Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I had an issue here to tell- I assume this behavior is intended but this time it caused issues even with clean shut down. First, my setup is a single primary (Virtual Machine/ VM) and single secondary (physical box). Due to power outage I had to shutdown the boxes. Clear shutdown, no hard power off! I shut down the secondary first then followed by the primary. This is log from primary: Sep 6 08:11:20 backuppc shutdown[20850]: shutting down for system halt Sep 6 08:11:23 backuppc kernel: drbd0: State change failed: Device is held open by someone Sep 6 08:11:23 backuppc kernel: drbd0: state = { cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate r--- } Sep 6 08:11:23 backuppc kernel: drbd0: wanted = { cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate r--- } Sep 6 08:11:26 backuppc kernel: VMware memory control driver unloaded Sep 6 08:11:26 backuppc kernel: Removing vmci device Sep 6 08:11:26 backuppc kernel: Resetting vmci device Sep 6 08:11:26 backuppc kernel: Unregistered vmci device. Sep 6 08:11:26 backuppc kernel: ACPI: PCI interrupt for device 0000:00:07.7 disabled Sep 6 08:11:39 backuppc kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) Sep 6 08:11:39 backuppc kernel: drbd0: Writing meta data super block now. Sep 6 08:11:39 backuppc kernel: drbd0: Creating new current UUID Sep 6 08:11:39 backuppc kernel: drbd0: Writing meta data super block now. Sep 6 08:11:39 backuppc kernel: drbd0: meta connection shut down by peer. Sep 6 08:11:39 backuppc kernel: drbd0: asender terminated Sep 6 08:11:39 backuppc kernel: drbd0: Terminating asender thread Sep 6 08:11:39 backuppc kernel: drbd0: tl_clear() Sep 6 08:11:39 backuppc kernel: drbd0: Connection closed Sep 6 08:11:39 backuppc kernel: drbd0: conn( TearDown -> Unconnected ) Sep 6 08:11:39 backuppc kernel: drbd0: receiver terminated Sep 6 08:11:39 backuppc kernel: drbd0: receiver (re)started Sep 6 08:11:39 backuppc kernel: drbd0: conn( Unconnected -> WFConnection ) Both where in consistent "Up-to-date" state before. The drbd devices didn't come up properly after outage. The root cause seems to be in the disk driver needed to access the disk inside the VM. The primary booted with an updated kernel so the SCSI driver was not available. Because of this drbd could not access the physical disk (and the secondary was still powered off) and went into diskless mode. Additionally the network was not available so it couldn't access any network resources. I'm not sure about the disk access but I'm, sure about the missing network. See logfile for this: Sep 6 08:19:30 backuppc kernel: drbd0: Starting receiver thread (from drbd0_worker [3082]) Sep 6 08:19:30 backuppc kernel: drbd0: receiver (re)started Sep 6 08:19:30 backuppc kernel: drbd0: conn( Unconnected -> WFConnection ) Sep 6 08:19:30 backuppc kernel: drbd0: Unable to bind source sock (-99) Sep 6 08:19:30 backuppc last message repeated 2 times Sep 6 08:19:30 backuppc kernel: drbd0: Unable to bind sock2 (-99) Sep 6 08:19:30 backuppc kernel: drbd0: conn( WFConnection -> Disconnecting ) Sep 6 08:19:30 backuppc kernel: drbd0: Discarding network configuration. Sep 6 08:19:30 backuppc kernel: drbd0: tl_clear() Sep 6 08:19:30 backuppc kernel: drbd0: Connection closed Sep 6 08:19:30 backuppc kernel: drbd0: conn( Disconnecting -> StandAlone ) Sep 6 08:19:30 backuppc kernel: drbd0: ASSERT( mdev->receiver.t_state == None ) in /home/buildsvn/rpmbuild/BUILD/drbd-8.2.6/_kmod_build_/drbd/drbd_main.c:2412 Sep 6 08:19:30 backuppc kernel: drbd0: drbd_bm_resize called with capacity == 0 Sep 6 08:19:30 backuppc kernel: drbd0: worker terminated Sep 6 08:19:30 backuppc kernel: drbd0: Terminating worker thread Sep 6 08:19:30 backuppc kernel: drbd0: receiver terminated Sep 6 08:19:30 backuppc kernel: drbd0: Terminating receiver thread Sep 6 08:19:30 backuppc kernel: drbd0: State change failed: Refusing to be Primary without at least one UpToDate disk Sep 6 08:19:30 backuppc kernel: drbd0: state = { cs:StandAlone st:Secondary/Unknown ds:Diskless/DUnknown r--- } Sep 6 08:19:30 backuppc kernel: drbd0: wanted = { cs:StandAlone st:Primary/Unknown ds:Diskless/DUnknown r--- } This is the result of "cat /proc/drbd" on the primary one after fixing network and driver issues now: version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-i386-build, 2008-10-03 11:42:32 0: cs:Connected st:Secondary/Secondary ds:Diskless/UpToDate A r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:0 And on the secondary: version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-i386-build, 2008-10-03 11:42:32 0: cs:Connected st:Secondary/Secondary ds:UpToDate/Diskless A r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:0 These are the logfiles writen at time of shutdown [secondary]: Sep 6 08:10:47 drbd kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Sep 6 08:10:47 drbd kernel: drbd0: short read expecting header on sock: r=-512 Sep 6 08:10:47 drbd kernel: drbd0: asender terminated Sep 6 08:10:47 drbd kernel: drbd0: Terminating asender thread Sep 6 08:10:47 drbd kernel: drbd0: Writing meta data super block now. Sep 6 08:10:48 drbd kernel: drbd0: tl_clear() Sep 6 08:10:48 drbd kernel: drbd0: Connection closed Sep 6 08:10:48 drbd kernel: drbd0: conn( Disconnecting -> StandAlone ) Sep 6 08:10:48 drbd kernel: drbd0: receiver terminated Sep 6 08:10:48 drbd kernel: drbd0: Terminating receiver thread Sep 6 08:10:48 drbd kernel: drbd0: disk( UpToDate -> Diskless ) Sep 6 08:10:48 drbd kernel: drbd0: drbd_bm_resize called with capacity == 0 Sep 6 08:10:48 drbd kernel: drbd0: worker terminated Sep 6 08:10:48 drbd kernel: drbd0: Terminating worker thread Sep 6 08:10:48 drbd kernel: drbd: module cleanup done. To fix the issue I had to set the secondary as primary and to secondary again. Then I was able to set the primary to primary node and everything was up and running again. Just appears to be a little bit unreliable to me. Even though I can explain and understand the primary was possibly on some"split-brain" scenario and discarded the disk as it was (possibly) temporarily inaccessible. Anyways, just wanted to share. Christian