[DRBD-user] DRBD stuck after a strong network failure

Wed Apr 26 00:24:36 CEST 2006

/ 2006-04-25 21:48:12 +0200
\ Cyril Bouthors:
> Same thing happened tonight, here are more information, hop it helps:
> 
> root at ns1:~# ps auxwf | grep drbd
> root       879  0.1  0.0      0     0 ?        D    Apr18  17:40 [drbd0_receiver]
> root     22526  0.0  0.0   1432   444 pts/5    S+   21:39   0:00  |               \_ grep drbd
> root at ns1:~# w
>  21:40:20 up 7 days,  8:11,  4 users,  load average: 194.27, 187.96, 158.50
> (...)
> root at ns1:~# dmesg
> (...)
> nfs: server 10.0.9.254 OK
> nfs: server 10.0.9.254 not responding, still trying
> nfs: server 10.0.9.254 OK
> nfs: server 10.0.9.254 not responding, still trying
> nfs: server 10.0.9.254 OK
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967295
> drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
> drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
> drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967294
> drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967293
> drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967295
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967294
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967293
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967292
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967295
> drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967294
> drbd0: PingAck did not arrive in time.
> drbd0: drbd0_asender [889]: cstate Connected --> NetworkFailure
> drbd0: asender terminated
> drbd0: kjournald [1180]: cstate NetworkFailure --> Timeout
> drbd0: drbd0_receiver [879]: cstate Timeout --> BrokenPipe
> drbd0: short read expecting header on sock: r=-512
> drbd0: short sent UnplugRemote size=8 sent=-1001
> drbd0: worker terminated
> root at ns1:~# ps auxwf | grep kupd
> root         6  0.0  0.0      0     0 ?        D    Apr18   1:43 [kupdated]
> root at ns1:~# cat /proc/drbd

> version: 0.7.15 (api:77/proto:74)
> SVN Revision: 2020 build by root at sqlb1, 2006-01-12 06:14:29
>  0: cs:BrokenPipe st:Primary/Unknown ld:Consistent
>     ns:521240 nr:0 dw:237755488 dr:49577617 al:937152 bm:204 lo:3 pe:3 ua:0 ap:3
                                                                  ^    ^         ^

> root at ns1:~# drbdadm disconnect all
> Child process does not terminate!
> Exiting.
> root at ns1:~# dmesg
> (...)
> drbd0: worker terminated
> root at ns1:~# cat /proc/drbd
> version: 0.7.15 (api:77/proto:74)
> SVN Revision: 2020 build by root at sqlb1, 2006-01-12 06:14:29
>  0: cs:BrokenPipe st:Primary/Unknown ld:Consistent
>     ns:521240 nr:0 dw:237758456 dr:49577617 al:937177 bm:229 lo:3 pe:3 ua:0 ap:3
> root at ns1:~# df -h /drbd
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/drbd0            9.5G  7.6G  2.0G  80% /drbd
> root at ns1:~# ls /drbd
> etc  lost+found  root  usr  var
> root at ns1:~# touch /drbd/foo
> (this hangs....)
> root at ns1:~# mount
> /dev/hda4 on / type xfs (rw,noatime)
> proc on /proc type proc (rw)
> devpts on /dev/pts type devpts (rw,gid=5,mode=620)
> tmpfs on /dev/shm type tmpfs (rw)
> /dev/hda1 on /boot type xfs (rw)
> 10.0.9.254:/drbd/webalizer on /mnt type nfs (rw,addr=10.0.9.254)
> /dev/drbd0 on /drbd type ext3 (rw,nosuid,nodev,noatime)
> root at ns1:~# ps -axo pid,wchan=WIDE-WCHAN-COLUMN -o comm

>   PID WIDE-WCHAN-COLUMN COMMAND
>     6 down              kupdated
>   879 ?                 drbd0_receiver
>  1180 wait_on_buffer    kjournald
>  2527 down              drbdsetup

my preliminary analysis is:
the receiver is stuck in tl_clear, which does a kmalloc(,GFP_KERNEL).
we fixed that in svn to GFP_NOIO already...

even though we ask only for a few bytes, GFP_KERNEL waits for memory and
can trigger fs and block level io, and under certain memory and io
pressure that may end up blocking on itself:
if the vm choses to kick the kjournald,
but that kjournald is blocking on drbd, because it wait_on_buffer() on
some of the buffers that would have their corresponding end_io called
right after tl_clear had got that memory...

what I am saying is that it looks like drbd0_receiver waits for memory,
which waits for kupdated, which waits for kjournald to do flush some
buffers to disk, which waits for drbd0_receiver to finish those buffers
it is waiting on.
drbdsetup (from drbdadm disconnect all) is probably waiting for the
super block to flush the local buffers, so in the end it is waiting for
the drbd0_receiver, too.
all the other processes in 'D' state are probably waiting for the
kjournald to finish its transaction...

if you can "reproduce" this scenario,
please try with current drbd-0.7 svn,
which should be released as 0.7.18 soonish.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.