Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I'm now setting up a fairly standard Primary/Primary configuration: OCFS2 on top of DRBD, on two machines running Debian testing (Linux 2.6.25, DRBD 8.0.12 as packaged). It works fine most of the time so far, except one thing: when both boxes are rebooted. I noticed this when testing shutdowns with Network UPS Tools. - node1 is NUT master (with RS232 connection to the UPS), node2 is NUT slave (gets UPS status over the network), both are powered from the same UPS - run "upsmon -c fsd" on node1 to simulate low battery condition - node2 (NUT slave) shuts down first, does failover to node1 - node1 (NUT master) shuts down last (after 1 minute FINALDELAY) - UPS powers down - UPS powers up again, both nodes start - node1 becomes Primary, node2 becomes Secondary (fails to mount OCFS2!) - manually reboot node2 - now Primary, mounts OCFS2 as it should - reboot any single node - no problem, still Primary/Primary I verified everything shuts down cleanly before the power is cut. What could be wrong, why is the extra reboot of node2 necessary to make it Primary again? What are these TOO_SMALL errors below? Kernel messages from node2 when the problem occurs: disk( Diskless -> Attaching ) Starting worker thread (from cqueue [1750]) Found 4 transactions (16 active extents) in activity log. max_segment_size ( = BIO size ) = 32768 drbd_bm_resize called with capacity == 209708728 resync bitmap: bits=26213591 words=409588 size = 100 GB (104854364 KB) reading of bitmap took 13 jiffies recounting of set bits took additional 1 jiffies 0 KB (0 bits) marked out-of-sync by on disk bit-map. disk( Attaching -> UpToDate ) Writing meta data super block now. Barriers not supported on meta data device - disabling conn( StandAlone -> Unconnected ) Starting receiver thread (from drbd0_worker [2166]) receiver (re)started conn( Unconnected -> WFConnection ) Handshake successful: DRBD Network Protocol version 86 Peer authenticated using 20 bytes of 'sha1' HMAC conn( WFConnection -> WFReportParams ) Starting asender thread (from drbd0_receiver [2180]) peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Writing meta data super block now. Requested state change failed by peer: TOO_SMALL Requested state change failed by peer: TOO_SMALL State change failed: State changed was refused by peer node state = { cs:WFBitMapT st:Secondary/Secondary ds:UpToDate/UpToDate r--- } wanted = { cs:WFBitMapT st:Primary/Secondary ds:UpToDate/UpToDate r--- } conn( WFBitMapT -> WFSyncUUID ) peer( Secondary -> Primary ) conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Began resync as SyncTarget (will sync 48 KB [12 bits set]). Writing meta data super block now. Resync done (total 1 sec; paused 0 sec; 48 K/sec) conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Writing meta data super block now. local disk flush failed with status -95 Kernel messages from node2 on subsequent reboot which fixes it: disk( Diskless -> Attaching ) Starting worker thread (from cqueue [1736]) Found 4 transactions (16 active extents) in activity log. max_segment_size ( = BIO size ) = 32768 drbd_bm_resize called with capacity == 209708728 resync bitmap: bits=26213591 words=409588 size = 100 GB (104854364 KB) reading of bitmap took 13 jiffies recounting of set bits took additional 1 jiffies 0 KB (0 bits) marked out-of-sync by on disk bit-map. disk( Attaching -> UpToDate ) Writing meta data super block now. Barriers not supported on meta data device - disabling conn( StandAlone -> Unconnected ) Starting receiver thread (from drbd0_worker [2152]) receiver (re)started conn( Unconnected -> WFConnection ) Handshake successful: DRBD Network Protocol version 86 Peer authenticated using 20 bytes of 'sha1' HMAC conn( WFConnection -> WFReportParams ) Starting asender thread (from drbd0_receiver [2166]) peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Writing meta data super block now. role( Secondary -> Primary ) Writing meta data super block now. conn( WFBitMapT -> WFSyncUUID ) conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Began resync as SyncTarget (will sync 4 KB [1 bits set]). Writing meta data super block now. Resync done (total 1 sec; paused 0 sec; 4 K/sec) conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Writing meta data super block now. Suspicious lines (only when node2 fails to become Primary): Requested state change failed by peer: TOO_SMALL Requested state change failed by peer: TOO_SMALL State change failed: State changed was refused by peer node state = { cs:WFBitMapT st:Secondary/Secondary ds:UpToDate/UpToDate r--- } wanted = { cs:WFBitMapT st:Primary/Secondary ds:UpToDate/UpToDate r--- } local disk flush failed with status -95 Thanks, Marek