Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, May 28, 2009 at 03:40:53PM +0200, Dominik Klein wrote: > Hi > > i just told my cluster (pacemaker 1.0.3) to stop a drbd device (crm > resource stop). DRBD version is 8.2.7 > > This happened: > > May 28 15:10:27 cl-virt-1 kernel: drbd7: role( Primary -> Secondary ) > May 28 15:10:28 cl-virt-1 kernel: drbd7: Requested state change failed > by peer: Concurrent state changes detected and aborted (-19) > May 28 15:10:28 cl-virt-1 kernel: drbd7: Requested state change failed > by peer: Concurrent state changes detected and aborted (-19) > May 28 15:10:28 cl-virt-1 kernel: drbd7: Requested state change failed > by peer: Concurrent state changes detected and aborted (-19) > May 28 15:10:28 cl-virt-1 kernel: drbd7: State change failed: State > changed was refused by peer node > ------^ there's a typo btw-------------------------------------- > May 28 15:10:28 cl-virt-1 kernel: drbd7: state = { cs:Connected > st:Secondary/Secondary ds:UpToDate/UpToDate r--- } > May 28 15:10:28 cl-virt-1 kernel: drbd7: wanted = { cs:Connected > st:Secondary/Secondary ds:Diskless/UpToDate r--- } > May 28 15:10:28 cl-virt-1 kernel: drbd7: peer( Secondary -> Unknown ) > conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) > May 28 15:10:28 cl-virt-1 kernel: drbd7: asender terminated > May 28 15:10:28 cl-virt-1 kernel: drbd7: Terminating asender thread > May 28 15:10:28 cl-virt-1 kernel: drbd7: Connection closed > May 28 15:10:28 cl-virt-1 kernel: drbd7: conn( TearDown -> Unconnected ) > May 28 15:10:28 cl-virt-1 kernel: drbd7: receiver terminated > May 28 15:10:28 cl-virt-1 kernel: drbd7: Restarting receiver thread > May 28 15:10:28 cl-virt-1 kernel: drbd7: receiver (re)started > May 28 15:10:28 cl-virt-1 kernel: drbd7: conn( Unconnected -> > WFConnection ) > May 28 15:10:28 cl-virt-1 lrmd: [21712]: info: RA output: > (drbd-micro-web01:0:stop:stderr) 2009/05/28_15:10:28 ERROR: micro-web01: > Command output: /dev/drbd7: State change failed: (-10) State changed was > refused by peer node /dev/drbd7: State change failed: (-10) State > changed was refused by peer node Command 'drbdsetup /dev/drbd7 down' > terminated with exit code 11 > May 28 15:10:28 cl-virt-1 lrmd: [21712]: info: RA output: > (drbd-micro-web01:0:stop:stdout) /dev/drbd7: State change failed: (-10) > State changed was refused by peer node /dev/drbd7: State change failed: > (-10) State changed was refused by peer node Command 'drbdsetup > /dev/drbd7 down' terminated with exit code 11 > > This is the log from the other node: > > May 28 15:10:27 cl-virt-2 kernel: drbd7: peer( Primary -> Secondary ) > May 28 15:10:28 cl-virt-2 kernel: drbd7: peer( Secondary -> Unknown ) > conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) > May 28 15:10:28 cl-virt-2 kernel: drbd7: short read expecting header on > sock: r=-512 > May 28 15:10:28 cl-virt-2 kernel: drbd7: asender terminated > May 28 15:10:28 cl-virt-2 kernel: drbd7: Terminating asender thread > May 28 15:10:28 cl-virt-2 kernel: drbd7: Connection closed > May 28 15:10:28 cl-virt-2 kernel: drbd7: conn( Disconnecting -> > StandAlone ) > May 28 15:10:28 cl-virt-2 kernel: drbd7: receiver terminated > May 28 15:10:28 cl-virt-2 kernel: drbd7: Terminating receiver thread > May 28 15:10:28 cl-virt-2 kernel: drbd7: disk( UpToDate -> Diskless ) > May 28 15:10:28 cl-virt-2 kernel: drbd7: drbd_bm_resize called with > capacity == 0 > May 28 15:10:28 cl-virt-2 kernel: drbd7: worker terminated > May 28 15:10:28 cl-virt-2 kernel: drbd7: Terminating worker thread > > What exactly happened there and how can I avoid it? I have no idea. possibly both have been told to "down" at the exact same time. there are a few "cluster wide state changes", and while one of those is pending, no other cluster wide state change is allowed. apparently virt-1 attempted to detach (become Diskless) while virt-2 attempted (and finally succeeded) to disconnect. this probably cannot be avoided, though it should be very rare. it may be worked around in the resource agent by some retry logic. I said it should be rare, so it should not be easy to reproduce. if you find a (simple) procedure to reproduce it, let me know, we will find out why it happens, and make it behave better. if you cannot easily reproduce it: => WONTFIX. no, "enable debug" in drbd would not help to understand what exactly happened. though it is possible to use the tracing framework. documentation of drbd tracing: see drbd source code. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed