Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
/ 2006-07-14 16:24:56 +1000 \ Bradley Baetz: > Hi, > > I've run into a problem with DRBD where it gets stuck. > > We have two machines, doing the DRBD replication via a crossover cable on eth1. > Both machines started off running drbd 0.7.20, heartbeat 1.2.3, and the RHEL4 > kernel 2.6.9-34.0.1.ELsmp. The kernel is patched to disable NAPI on the e1000 > driver. > > Both are single CPU boxes, but with hyperthreading enabled. DRBD has > ext3 running on top of it, being used for a postgres database. > > Plan was to upgrade it to 2.6.9-34.0.2.ELsmp. I rebuilt DRBD for the new > kernel, and rebooted the standby box. That came up fine: > I then went to do hb_standby on the current primary. This started to > shutdown postgresql, but then stopped - postgres and kjournald were stuck > in D state. I tried to do a reboot, but that just caused reboot and > shutdown processes to end up in D state too. sysreq-T (via /proc/sysrq-trigger) > has: ... > with postgres doing: ... > There weren't any messages from DRBD on either box, and it was in > Secondary/Primary state on the secondary. > > I then did a reboot -f. When the primary came back up, it started to > resync (heartbeat wasn't running on the secondary, so there was no > takeover attempt while it was rebooting): > > Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFConnection --> WFReportParams > Jul 14 11:04:33 dbtools01 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 > Jul 14 11:04:33 dbtools01 kernel: drbd0: Connection established. > Jul 14 11:04:33 dbtools01 kernel: drbd0: I am(S): 1:00000004:00000001:0000001d:00000002:10 > Jul 14 11:04:33 dbtools01 kernel: drbd0: Peer(S): 1:00000004:00000001:0000001d:00000002:01 > Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFReportParams --> WFBitMapS > Jul 14 11:04:33 dbtools01 drbd: WARN: stdin/stdout is not a TTY; using /dev/console > Jul 14 11:04:33 dbtools01 rc: Starting drbd: succeeded > Jul 14 11:04:33 dbtools01 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary > Jul 14 11:04:33 dbtools01 kernel: drbd0: drbd0_receiver [1975]: cstate WFBitMapS --> SyncSource > Jul 14 11:04:33 dbtools01 kernel: drbd0: Resync started as SyncSource (need to sync 802816 KB [200704 bits set]). > Jul 14 11:05:06 dbtools01 kernel: drbd0: Secondary/Secondary --> Primary/Secondary > > but then showed the transfer as 'stalled' on both ends: > > version: 0.7.20 (api:79/proto:74) > SVN Revision: 2260 build by root at build03.syd.optusnet.com.au, 2006-07-14 10:39:06 > 0: cs:SyncSource st:Primary/Secondary ld:Consistent > ns:98444 nr:0 dw:6644 dr:115025 al:6 bm:214 lo:0 pe:0 ua:0 ap:0 > [>...................] sync'ed: 0.6% (710976/710976)K > stalled > Jul 14 11:05:13 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967295 > .... > Jul 14 11:09:28 dbtools01 kernel: drbd0: [mount/2726] sock_sendmsg time expired, ko = 4294967211 this I have seen myself once, but were not able to reproduce. that time it was possible to get out of it by just doing "drbdadm connect all". just to not let the primary get stuck: configure ko-count, though it would then go StandAlone because it would suspect the io-subsystem of the peer to be broken; this would be wrong in this case, but the symptoms would be cured anyways. > postgres wasn't starting on the primary yet, either. sysrq-T shows the only D state thing being pdflush: > > Jul 14 11:09:34 dbtools01 kernel: pdflush D F7D8C680 2752 47 6 49 46 (L-TLB) > Jul 14 11:09:34 dbtools01 kernel: f7cf2c58 00000046 f6337d80 f7d8c680 f6337d80 c022a224 f7d8c680 f7ce0b68 > Jul 14 11:09:34 dbtools01 kernel: f6337d80 c022a79d f633522c c1816de0 00000001 000336a4 94d89130 00000024 > Jul 14 11:09:34 dbtools01 kernel: f7e110b0 f7cc4130 f7cc429c c1bd965c 00000001 c1bd9284 00000246 c1bd928c > Jul 14 11:09:34 dbtools01 kernel: Call Trace: > Jul 14 11:09:34 dbtools01 kernel: [<c022a224>] cfq_add_crq_rb+0x3b/0x4d it hangs somewhere in the cfq io-scheduler, not in drbd. but the scheduler may be waiting for some completion event from drbd. so maybe using a different io-scheduler, say deadtime, helps. maybe not. > I stopped/started drbd on the secondary, and the primary logged: more-or-less normal... > But it still got stuck as 'stalled'. I left it for an hour or so, did > another stop/start, and that didn't help. the receiver on the secondary side is stuck somewhere. It would be interessting to know where. Once in this state, drbd would only get out of it if you configure ko-count to some smallish positive number, even though the intended usage for that option was something different. > I then stopped it on the secondary, and shut down drbd on the primary too, for a cold start: > and then started DRBD again on the primary: > and then brought up heartbeat on the primary, and drbd on the > secondary, and it all synced fine: > I've seen this before a few times when doing upgrades/reboots, on > multiple different DRBD pairs. However, that was running older versions of > DRBD, and from the release notes for the last few versions I thought that > this was the situation mentioned there, and so would be fixed. However, > I can't get a reproducable test case. Does the above give enough information > to work out what the problem is? hm. we'll have to read it through a few more times, maybe it triggers some vague ideas, which we then could verify in source code. but no, nothing obvious, yet. > This is our test cluster, so I'm happy to try patches/etc. sorry :( as soon as we have them, though. > I'm also going to see if I can reproduce this. that would be great. thank you for this detailed report. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.