[DRBD-user] "Remote failed" and "State change failed" while trying the stress test

Lars Ellenberg lars.ellenberg at linbit.com
Wed Sep 7 12:09:16 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Sep 06, 2011 at 01:15:16PM +0900, Junko IKEDA wrote:
> Hi,
> 
> When I tried the stress test, I got the following messages.
> "Remote failed to finish a request within ko-count * timeout"
> 

We change how to handle some "overload" issues a bit, no harm done yet.
Still you probably should increase the timeouts or ko-count then.

>  * testing environment
> CPU:	Intel(R) Xeon(R) CPU 5160 @3.00GHz (dual core x2)
> Memory:	1024MB x4
> HDD:	Smart Array P400, SAS 74GB x2 (RAID1+0)
> OS:	RHEL 5.6 x86_64
> DRBD:	8.3.11
> 
>  * "stress" tool options (CPU and Memory utilizations become almost 100%)
> [root at dl380g5c ~]# stress --cpu 4 &
> [root at dl380g5c ~]# stress --vm 1 --vm-bytes 4000000K &
> [root at dl380g5c ~]# cd /drbd; stress -d 1 --hdd-bytes 1G &
> 
> Primary's syslog said;
> Aug 29 11:09:55 dl380g5c kernel: block drbd0: drbd_sync_handshake:
> Aug 29 11:09:55 dl380g5c kernel: block drbd0: self
> C40A42F4B83EC72E:0000000000000000:0001000000000000:0001000000000000
> bits:0 flags:0
> Aug 29 11:09:55 dl380g5c kernel: block drbd0: peer
> C40A42F4B83EC72E:0000000000000000:0001000000000000:0001000000000000
> bits:0 flags:0
> Aug 29 11:09:55 dl380g5c kernel: block drbd0: uuid_compare()=0 by rule 40
> Aug 29 11:09:55 dl380g5c kernel: block drbd0: peer( Unknown ->
> Secondary ) conn( WFReportParams -> Connected ) disk( Consistent ->
> UpToDate ) pdsk( DUnknown -> UpToDate )
> Aug 29 11:09:58 dl380g5c kernel: block drbd0: role( Secondary -> Primary )
> Aug 29 11:10:03 dl380g5c kernel: kjournald starting.  Commit interval 5 seconds
> Aug 29 11:10:03 dl380g5c kernel: EXT3-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Aug 29 11:10:03 dl380g5c kernel: EXT3 FS on drbd0, internal journal
> Aug 29 11:10:03 dl380g5c kernel: EXT3-fs: drbd0: 1 orphan inode deleted
> Aug 29 11:10:03 dl380g5c kernel: EXT3-fs: recovery complete.
> Aug 29 11:10:03 dl380g5c kernel: EXT3-fs: mounted filesystem with
> ordered data mode.
> Aug 29 11:13:37 dl380g5c ntpd[3813]: synchronized to 172.30.17.226, stratum 4
> Aug 29 11:13:36 dl380g5c ntpd[3813]: time reset -0.475105 s
> Aug 29 11:13:36 dl380g5c ntpd[3813]: kernel time sync enabled 0001
> 
>  * something timeout...
> Aug 29 11:13:43 dl380g5c kernel: block drbd0: Remote failed to finish
> a request within ko-count * timeout
> Aug 29 11:13:45 dl380g5c kernel: block drbd0: State change failed:
> Refusing to be Primary while peer is not outdated
> Aug 29 11:13:45 dl380g5c kernel: block drbd0:   state = { cs:Connected
> ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> Aug 29 11:13:45 dl380g5c kernel: block drbd0:  wanted = { cs:Timeout
> ro:Primary/Unknown ds:UpToDate/DUnknown r----- }

I see.
Yes, there seem to be some parts missing in the "improved logic".
This is supposed to be fixed with 8.3.12, which should be
released this month..

No harm done, though.

> Aug 29 11:17:59 dl380g5c ntpd[3813]: synchronized to 172.30.17.226, stratum 4
> 
>  * ko-count is expired, but it's not zero.
> Aug 29 11:24:13 dl380g5c kernel: block drbd0: [drbd0_worker/4001]
> sock_sendmsg time expired, ko = 2
> Aug 30 04:35:00 dl380g5c ntpd[3813]: no servers reachable
> Aug 30 04:52:04 dl380g5c ntpd[3813]: synchronized to 172.30.17.226, stratum 4
> 
>  * at this time, I tried to write/read some files on the replication
> area, and it succeeded.
>
> Should I disconnect the Secondary node manually?

No need.

It recovered in time all by itself.

The behaviour of ko-count reaching zero causes disconnect is retained.
This is not sufficient, though, if the peer eats, then never completes,
a single request, and the application waits for this particular request
to finish before it sends anything more.

What we added, but what apparently was still incomplete when used
together with fencing policies, is to forcefully disconnect
even when a single request takes longer than "timeout * ko-count".

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list