Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> OK. Your Primary has a pending count ("pe") > 0, which means it is > waiting for > the Secondary to complete stuff it is currently working on. So this > looks > like you're having issues on your Secondary. Please provide that ps > and /proc/drbd output for your Secondary, just like you did for your > Primary. So when the primary looks like this: version: 8.2.5 (api:88/proto:86-88) GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by buildsvn at c5-x8664-build, 2008-03-09 10:16:01 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:384 nr:440 dw:824 dr:3729 al:1 bm:5 lo:0 pe:2 ua:0 ap:1 resync: used:0/31 hits:49 misses:5 starving:0 dirty:0 changed:5 act_log: used:1/127 hits:95 misses:1 starving:0 dirty:0 changed:1 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:21008 dw:21008 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 The secondary looks like this: version: 8.2.5 (api:88/proto:86-88) GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by buildsvn at c5-x8664-build, 2008-03-09 10:16:01 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:42272 nr:32984 dw:38988 dr:45742 al:1 bm:52 lo:1 pe:0 ua:0 ap:0 resync: used:0/31 hits:2478 misses:28 starving:0 dirty:0 changed:28 act_log: used:0/127 hits:1500 misses:1 starving:0 dirty:0 changed:1 1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:145396 nr:12 dw:145728 dr:69130 al:1 bm:131 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:635 misses:9 starving:0 dirty:0 changed:9 act_log: used:0/127 hits:36428 misses:1 starving:0 dirty:0 changed:1 (Here we're looking at drbd0, so when I say "primary" I mean "the system that is primary for drbd0", and similarly for "secondary"). The two systems are passing network traffic back and forth while kjournald is hung. I took a short packet trace, which is available here: http://people.seas.harvard.edu/~lars/drbd/packets On the primary, kjournald is stuck in the D state in sync_buffer. On the secondary, there don't appear to be any similarly stuck processes. There no obvious errors in syslog on either system. There are entries like this on the secondary: drbd1: [drbd1_worker/2321] sock_sendmsg time expired, ko = 4294967255 But these appear to correspond to me shutting down the primary, rather than the beginning of the situation. -- Lars