Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
I'm now setting up a fairly standard Primary/Primary configuration:
OCFS2 on top of DRBD, on two machines running Debian testing
(Linux 2.6.25, DRBD 8.0.12 as packaged). It works fine most of
the time so far, except one thing: when both boxes are rebooted.
I noticed this when testing shutdowns with Network UPS Tools.
- node1 is NUT master (with RS232 connection to the UPS),
node2 is NUT slave (gets UPS status over the network),
both are powered from the same UPS
- run "upsmon -c fsd" on node1 to simulate low battery condition
- node2 (NUT slave) shuts down first, does failover to node1
- node1 (NUT master) shuts down last (after 1 minute FINALDELAY)
- UPS powers down
- UPS powers up again, both nodes start
- node1 becomes Primary, node2 becomes Secondary (fails to mount OCFS2!)
- manually reboot node2 - now Primary, mounts OCFS2 as it should
- reboot any single node - no problem, still Primary/Primary
I verified everything shuts down cleanly before the power is cut.
What could be wrong, why is the extra reboot of node2 necessary to
make it Primary again? What are these TOO_SMALL errors below?
Kernel messages from node2 when the problem occurs:
disk( Diskless -> Attaching )
Starting worker thread (from cqueue [1750])
Found 4 transactions (16 active extents) in activity log.
max_segment_size ( = BIO size ) = 32768
drbd_bm_resize called with capacity == 209708728
resync bitmap: bits=26213591 words=409588
size = 100 GB (104854364 KB)
reading of bitmap took 13 jiffies
recounting of set bits took additional 1 jiffies
0 KB (0 bits) marked out-of-sync by on disk bit-map.
disk( Attaching -> UpToDate )
Writing meta data super block now.
Barriers not supported on meta data device - disabling
conn( StandAlone -> Unconnected )
Starting receiver thread (from drbd0_worker [2166])
receiver (re)started
conn( Unconnected -> WFConnection )
Handshake successful: DRBD Network Protocol version 86
Peer authenticated using 20 bytes of 'sha1' HMAC
conn( WFConnection -> WFReportParams )
Starting asender thread (from drbd0_receiver [2180])
peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Writing meta data super block now.
Requested state change failed by peer: TOO_SMALL
Requested state change failed by peer: TOO_SMALL
State change failed: State changed was refused by peer node
state = { cs:WFBitMapT st:Secondary/Secondary ds:UpToDate/UpToDate r--- }
wanted = { cs:WFBitMapT st:Primary/Secondary ds:UpToDate/UpToDate r--- }
conn( WFBitMapT -> WFSyncUUID )
peer( Secondary -> Primary )
conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Began resync as SyncTarget (will sync 48 KB [12 bits set]).
Writing meta data super block now.
Resync done (total 1 sec; paused 0 sec; 48 K/sec)
conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Writing meta data super block now.
local disk flush failed with status -95
Kernel messages from node2 on subsequent reboot which fixes it:
disk( Diskless -> Attaching )
Starting worker thread (from cqueue [1736])
Found 4 transactions (16 active extents) in activity log.
max_segment_size ( = BIO size ) = 32768
drbd_bm_resize called with capacity == 209708728
resync bitmap: bits=26213591 words=409588
size = 100 GB (104854364 KB)
reading of bitmap took 13 jiffies
recounting of set bits took additional 1 jiffies
0 KB (0 bits) marked out-of-sync by on disk bit-map.
disk( Attaching -> UpToDate )
Writing meta data super block now.
Barriers not supported on meta data device - disabling
conn( StandAlone -> Unconnected )
Starting receiver thread (from drbd0_worker [2152])
receiver (re)started
conn( Unconnected -> WFConnection )
Handshake successful: DRBD Network Protocol version 86
Peer authenticated using 20 bytes of 'sha1' HMAC
conn( WFConnection -> WFReportParams )
Starting asender thread (from drbd0_receiver [2166])
peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Writing meta data super block now.
role( Secondary -> Primary )
Writing meta data super block now.
conn( WFBitMapT -> WFSyncUUID )
conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Began resync as SyncTarget (will sync 4 KB [1 bits set]).
Writing meta data super block now.
Resync done (total 1 sec; paused 0 sec; 4 K/sec)
conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Writing meta data super block now.
Suspicious lines (only when node2 fails to become Primary):
Requested state change failed by peer: TOO_SMALL
Requested state change failed by peer: TOO_SMALL
State change failed: State changed was refused by peer node
state = { cs:WFBitMapT st:Secondary/Secondary ds:UpToDate/UpToDate r--- }
wanted = { cs:WFBitMapT st:Primary/Secondary ds:UpToDate/UpToDate r--- }
local disk flush failed with status -95
Thanks,
Marek