Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Aug 20, 2007 at 01:50:51PM +1000, Maciek Olczak wrote: > Hi all > > I hope someone will be able to help with a rather stubbornly persistent and yet > random problem which has been bugging us for over two months. Every night > (backups time) six replica devices in all are asked to disconnect with: > > drbdadm disconnect replica_resource > > On just one server at times this causes process > > drbdsetup /dev/drbd1 disconnect > > to hang in such a way that it cannot be killed. But the device is disconnected > and replication is stopped. This may not happen for six nights and then it will > hang just a few hours after the system was rebooted. > > Misbehaving server is one from two pairs of servers which run the same version > of DRBD (drbd-0.7.24-1) and kernel (Redhat EL4 2.6.9-55.ELsmp), although they > are of different architectures (i368 and x86_64). Other difference between the > two pairs is that one uses hardware RAID-ed devices and the one in question > uses native metadisk (/dev/md?) devices. Another piece of "exotic" > configuration is that IP addresses DRBD devices are bound to are on OpenVPN's > tun0 interfaces. > > I have been at pains to find out what is the difference between the two servers > in this pair, which causes this problem. I reconciled DRBD config files and OS > RPMs and still did not get rid of this problem. > > Being guided by http://lists.linbit.com/pipermail/drbd-user/2006-April/ > 004832.html I gathered following evidence: > > # cat /proc/drbd > > version: 0.7.24 (api:79/proto:74) > SVN Revision: 2875 build by buildsystem at linbit, 2007-06-04 14:03:37 > 0: cs:Connected st:Primary/Secondary ld:Consistent > ns:32298620 nr:0 dw:31831224 dr:76573961 al:1370841 bm:10302 lo:3 pe:4 ua:0 > ap:2 > 1: cs:BrokenPipe st:Secondary/Unknown ld:Consistent > ns:0 nr:5038052 dw:35515816 dr:127195916 al:651 bm:5783 lo:3 pe:0 ua:3 ap:0 ^^^^ ^^^^ ^^^^ which "drbd protocol"? C? there are three requests pending against the local disk. they are also unacknowledged, for that reason. > # ps -ef|grep drbd|grep -v grep > > root 2057 1 0 Aug19 ? 00:01:10 [drbd1_receiver] > root 4495 1 0 Aug15 ? 00:02:05 [drbd0_receiver] > root 18176 1 0 00:46 ? 00:00:15 [drbd0_asender] > root 20219 1 0 01:18 ? 00:00:00 drbdsetup /dev/drbd1 disconnect > root 26539 1 0 00:35 ? 00:00:07 [drbd0_worker] > > # ps axo pid,wchan=WIDE-WCHAN-COLUMN,cmd -o comm | grep -w 2057 | grep -v grep > > 2057 drbd_wait_ee [drbd1_receiver] drbd1_receiver the receiver waits for the reference count to reach zero. > # ps axo pid,wchan=WIDE-WCHAN-COLUMN,cmd -o comm | grep 20219 | grep -v grep > > 20219 drbd_thread_stop drbdsetup /dev/d drbdsetup > > # cat /proc/20219/status > > Name: drbdsetup > State: D (disk sleep) ... and this waits for the receiver to terminate. > Aug 20 01:18:30 butterfly1 kernel: drbd1: drbdsetup [20219]: cstate Connected --> Unconnected > Aug 20 01:18:30 butterfly1 kernel: drbd1: drbd1_receiver [2057]: cstate Unconnected --> BrokenPipe > Aug 20 01:18:30 butterfly1 kernel: drbd1: short read expecting header on sock: r=-512 > Aug 20 01:18:30 butterfly1 kernel: drbd1: worker terminated > Aug 20 01:18:30 butterfly1 kernel: drbd1: asender terminated > > # > > > First block in the above is for the replica device on the remote server being > disconnected, backed up and reconnected. The second block is ALL there is in > messages (and dmesg) for this stuck "disconnect" command. either the local disk just does not process these pending requests, or we have some deadlock within drbd, where the only one that could decrement those reference counts is itself waiting for this to happen, or is dead already. I've seen both in the past already. we fixed race conditions in the past, which resulted in deadlocks in drbd. we may have more of those. but I've also seen drbd linger in BrokenPipe or NetworkFailure for ~20 minutes, before recovering itself suddenly, so I doubt it is a deadlock within drbd, that would not recover. I know that sometimes requests may be queued almost indefinetely, especially on MD, which strips off the BIO_RW_SYNC bit from requests passed down to it, I wrote a patch for that myself. maybe it helps to say "sync", or enable sysrequest ("echo 1 > /proc/sys/kernel/sysrq"), and then trigger an emergency sync ("echo s > /proc/sysrq-trigger"). > There is one other curious thing I noticed. When DRBD is restarted (reboot) and > both devices are reconnected, the device which was cs:BrokenPipe will receive a > fair bit of re-sync data, as one would expect it after not being connected for > a few busy hours. But the odd thing is that when the primary device is > reconnected, a similarly large amount of data is re-synced, as if this device > was disconnected just as long as the replica, despite being off for just 2-3 > minutes it takes to reboot. Example: > > > 0: cs:WFConnection st:Primary/Unknown ld:Consistent > ns:0 nr:0 dw:3600 dr:104253 al:157 bm:284 lo:0 pe:0 ua:0 ap:0 > 1: cs:SyncTarget st:Secondary/Primary ld:Inconsistent > ns:0 nr:41584 dw:41580 dr:0 al:0 bm:27 lo:266 pe:348 ua:266 ap:0 > [=>..................] sync'ed: 5.1% (848796/889276)K > > 0: cs:SyncSource st:Primary/Secondary ld:Consistent > ns:6380 nr:0 dw:11088 dr:211713 al:492 bm:616 lo:1 pe:57 ua:457 ap:0 > [>...................] sync'ed: 1.6% (517976/524116)K > finish: 0:01:20 speed: 6,140 (6,140) K/sec > 1: cs:Connected st:Secondary/Primary ld:Consistent > ns:0 nr:898404 dw:898404 dr:0 al:0 bm:2570 lo:6 pe:0 ua:6 ap:0 > finish: 0:01:02 speed: 13,492 (13,492) K/sec > > > I've been monitoring both DRBD devices' stats as well as the underlying disk > devices' stats when the replica device is disconnected and stuck. During that > time the primary DRBD device (drbd0), its remote replica (drbd1) and both > metadisk devices (md?) show data being sent/received via network and written to > metadisks. Why then so many hundreds of MBs need to be re-synced? huh? why is drbd1 the remote replica of drbd0? you lost me here. but anyways, if you reboot a primary without making it secondary first, it looks like a primary crash, so we have to resync the area covered by the activity log. could that explain your observation? -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.