Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2006-04-25 21:48:12 +0200 \ Cyril Bouthors: > Same thing happened tonight, here are more information, hop it helps: > > root at ns1:~# ps auxwf | grep drbd > root 879 0.1 0.0 0 0 ? D Apr18 17:40 [drbd0_receiver] > root 22526 0.0 0.0 1432 444 pts/5 S+ 21:39 0:00 | \_ grep drbd > root at ns1:~# w > 21:40:20 up 7 days, 8:11, 4 users, load average: 194.27, 187.96, 158.50 > (...) > root at ns1:~# dmesg > (...) > nfs: server 10.0.9.254 OK > nfs: server 10.0.9.254 not responding, still trying > nfs: server 10.0.9.254 OK > nfs: server 10.0.9.254 not responding, still trying > nfs: server 10.0.9.254 OK > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967295 > drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 > drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 > drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967294 > drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967293 > drbd0: [kupdated/6] sock_sendmsg time expired, ko = 4294967295 > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967295 > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967294 > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967293 > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967292 > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967295 > drbd0: [kjournald/1180] sock_sendmsg time expired, ko = 4294967294 > drbd0: PingAck did not arrive in time. > drbd0: drbd0_asender [889]: cstate Connected --> NetworkFailure > drbd0: asender terminated > drbd0: kjournald [1180]: cstate NetworkFailure --> Timeout > drbd0: drbd0_receiver [879]: cstate Timeout --> BrokenPipe > drbd0: short read expecting header on sock: r=-512 > drbd0: short sent UnplugRemote size=8 sent=-1001 > drbd0: worker terminated > root at ns1:~# ps auxwf | grep kupd > root 6 0.0 0.0 0 0 ? D Apr18 1:43 [kupdated] > root at ns1:~# cat /proc/drbd > version: 0.7.15 (api:77/proto:74) > SVN Revision: 2020 build by root at sqlb1, 2006-01-12 06:14:29 > 0: cs:BrokenPipe st:Primary/Unknown ld:Consistent > ns:521240 nr:0 dw:237755488 dr:49577617 al:937152 bm:204 lo:3 pe:3 ua:0 ap:3 ^ ^ ^ > root at ns1:~# drbdadm disconnect all > Child process does not terminate! > Exiting. > root at ns1:~# dmesg > (...) > drbd0: worker terminated > root at ns1:~# cat /proc/drbd > version: 0.7.15 (api:77/proto:74) > SVN Revision: 2020 build by root at sqlb1, 2006-01-12 06:14:29 > 0: cs:BrokenPipe st:Primary/Unknown ld:Consistent > ns:521240 nr:0 dw:237758456 dr:49577617 al:937177 bm:229 lo:3 pe:3 ua:0 ap:3 > root at ns1:~# df -h /drbd > Filesystem Size Used Avail Use% Mounted on > /dev/drbd0 9.5G 7.6G 2.0G 80% /drbd > root at ns1:~# ls /drbd > etc lost+found root usr var > root at ns1:~# touch /drbd/foo > (this hangs....) > root at ns1:~# mount > /dev/hda4 on / type xfs (rw,noatime) > proc on /proc type proc (rw) > devpts on /dev/pts type devpts (rw,gid=5,mode=620) > tmpfs on /dev/shm type tmpfs (rw) > /dev/hda1 on /boot type xfs (rw) > 10.0.9.254:/drbd/webalizer on /mnt type nfs (rw,addr=10.0.9.254) > /dev/drbd0 on /drbd type ext3 (rw,nosuid,nodev,noatime) > root at ns1:~# ps -axo pid,wchan=WIDE-WCHAN-COLUMN -o comm > PID WIDE-WCHAN-COLUMN COMMAND > 6 down kupdated > 879 ? drbd0_receiver > 1180 wait_on_buffer kjournald > 2527 down drbdsetup my preliminary analysis is: the receiver is stuck in tl_clear, which does a kmalloc(,GFP_KERNEL). we fixed that in svn to GFP_NOIO already... even though we ask only for a few bytes, GFP_KERNEL waits for memory and can trigger fs and block level io, and under certain memory and io pressure that may end up blocking on itself: if the vm choses to kick the kjournald, but that kjournald is blocking on drbd, because it wait_on_buffer() on some of the buffers that would have their corresponding end_io called right after tl_clear had got that memory... what I am saying is that it looks like drbd0_receiver waits for memory, which waits for kupdated, which waits for kjournald to do flush some buffers to disk, which waits for drbd0_receiver to finish those buffers it is waiting on. drbdsetup (from drbdadm disconnect all) is probably waiting for the super block to flush the local buffers, so in the end it is waiting for the drbd0_receiver, too. all the other processes in 'D' state are probably waiting for the kjournald to finish its transaction... if you can "reproduce" this scenario, please try with current drbd-0.7 svn, which should be released as 0.7.18 soonish. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.