Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> > so I have to try and figure why the asender died in the first place. ok, with the neccessary contex, this was easy. starting from scratch: > Aug 4 10:39:16 test2 kernel: drbd: initialised. Version: 0.7.11 (api:77/proto:74) > Aug 4 10:39:16 test2 kernel: drbd: SVN Revision: 1799 build by root at test2.zmnh.uni-hamburg.de, 2005-07-14 13:14:36 > Aug 4 10:39:16 test2 kernel: drbd: registered as block device major 147 > Aug 4 10:39:17 test2 kernel: drbd0: resync bitmap: bits=107067201 words=3345852 > Aug 4 10:39:17 test2 kernel: drbd0: size = 408 GB (428268802 KB) ... > Aug 4 10:39:20 test2 kernel: drbd0: Secondary/Unknown --> Secondary/Primary ... > Aug 4 10:40:58 test2 kernel: drbd0: Resync done (total 97 sec; paused 0 sec; 5900 K/sec) > Aug 4 10:40:58 test2 kernel: drbd0: drbd0_worker [8887]: cstate SyncTarget --> Connected here the operator decides to invalidate this box. > Aug 4 11:34:41 test2 kernel: drbd0: Primary/Secondary --> Secondary/Secondary > Aug 4 11:34:54 test2 kernel: drbd0: drbdsetup [9660]: cstate Connected --> WFBitMapT > Aug 4 11:34:55 test2 kernel: drbd0: 428268804 KB now marked out-of-sync by on disk bit-map. > Aug 4 11:34:55 test2 kernel: drbd0: drbdsetup [9660]: cstate WFBitMapT --> SyncTarget > Aug 4 11:34:55 test2 kernel: drbd0: Resync started as SyncTarget (need to sync 428268804 KB [107067201 bits set]). note now, on the other box, at the same time, operator decides to invalidate that box too. and this is a race in our "state engine": > Aug 4 11:35:04 test1 kernel: drbd0: drbdsetup [11431]: cstate Connected --> WFBitMapT operator says: your data is bad. go get some from the peer. > Aug 4 11:36:16 test1 kernel: drbd0: 428268804 KB now marked out-of-sync by on disk bit-map. > Aug 4 11:36:17 test1 kernel: drbd0: drbd0_receiver [5420]: cstate WFBitMapT --> SyncSource peer says: my data is bad, please give me yours. > Aug 4 11:36:17 test1 kernel: drbd0: Resync started as SyncSource (need to sync 428268804 KB [107067201 bits set]). so ok, we become SyncSource, because peer wants our data. > Aug 4 11:37:05 test1 kernel: drbd0: 428072160 KB now marked out-of-sync by on disk bit-map. > Aug 4 11:37:05 test1 kernel: drbd0: drbdsetup [11431]: cstate SyncSource --> SyncTarget operator (still the same ioctl) says: hey, I told you your data is bad! ok, we become sync target (i.e. inconsistent, no good local data nowhere etc.), because operator tells us so. meanwhile, the peer already asked us for our data. now, since operator told us our data is bad, we answer "sorry, I don't know either" peer panics. or, well, would have paniced, if it had not tried to dereference the syncer block id (-1ULL) referenced in that "NegDReply" as a reqest pointer :-> ok, I reduced the race, and put in some other paranoia checks, so next time this happens, it really panics, with a more usefull message ("WE ARE LOST") ... all in current drbd-0.7 svn thanks, -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.