Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Comments inline. On Thu, 2006-09-28 at 09:47 +0200, KarinMiers wrote: > Hi all, > > we use drbd (different versions 0.6 to 0.7) for several services and > never observed any real problems until now - I set up drbd for our web > server and the primary node died during the first sync without any error > messages. > > Both systems run SuSE 10.0, Kernel 2.6.13-15.10-smp, drbd drbd-0.7.13-2. > The drbd is on a raid5 (ICP raid adapter GDT8623RZ), ext3 file system. > > Dis anybody observe a similar behaviour? Is there any known bug to that > combination of distribution, kernel and drbd which I did not recognize? > Or could it be a hardware problem? Any hints are welcome... Details of > the setup and logs are at the bottom of this mail. > > Bye, > > Karin > > -- > Dr. Karin A. Miers Tel.: 06159-71-1334 > Abtlg. IT E-Mail: K.Miers at gsi.de > > GSI mbH > Planckstraße 1 > 64291 Darmstadt > Tel.: 0049 - (0)6159 - 71-0 > -- > > > Set up was done the first time by commands using default values: > > On node 1: > > modprobe drbd > drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0 > drbdsetup /dev/drbd0 net 10.0.0.1 10.0.0.2 C > drbdsetup /dev/drbd0 primary > > On node 2 more or less the same: > > modprobe drbd > drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0 > drbdsetup /dev/drbd0 net 10.0.0.2 10.0.0.1 C > > After that, the sync starts as expected. /proc/drbd looks fine on both > nodes and drbdsetup /dev/drbd0 state/show too. But after some minutes (7 > to 12 minutes, not reproduceable time, not after a certain amount of > sync) node 1 is completely dead - just as if it is switched off. node 2 > notices that the other node is dead but apart from this it continues to > run as usual. > > I only tried it twice because node 1 is a production system and should > not break down too often :-)) > > First time node 1 stopped after 7 minutes, sync rate was 250 Kb/s (default). > > On the second try it stopped after appr. 12 minutes, sync rate was 10000 > KB/s. I increased it because it looked as if it would work. > > That is what the log says: > > Node 1 > > Sep 27 14:37:23 nodea kernel: drbd0: drbdsetup [13475]: cstate > Unconfigured --> Unconnected > Sep 27 14:37:23 nodea kernel: drbd0: drbd0_receiver [13477]: cstate > Unconnected --> WFConnection > Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate > WFConnection --> Unconnected > Sep 27 14:43:48 nodea kernel: drbd0: worker terminated > Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate > Unconnected --> Unconfigured > Sep 27 14:43:48 nodea kernel: drbd0: Connection lost. > Sep 27 14:43:48 nodea kernel: drbd0: Discarding network configuration. > Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate > Unconfigured --> StandAlone > Sep 27 14:43:48 nodea kernel: drbd0: receiver terminated > Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate > StandAlone --> Unconfigured > Sep 27 14:44:07 nodea kernel: drbd0: resync bitmap: bits=107478867 > words=3358716 > Sep 27 14:44:07 nodea kernel: drbd0: size = 409 GB (429915465 KB) > Sep 27 14:44:11 nodea kernel: drbd0: 409 GB marked out-of-sync by on > disk bit-map. > Sep 27 14:44:11 nodea kernel: drbd0: Found 4 transactions (136 active > extents) in activity log. > Sep 27 14:44:11 nodea kernel: drbd0: Marked additional 2048 KB as > out-of-sync based on AL. > Sep 27 14:44:11 nodea kernel: drbd0: drbdsetup [13851]: cstate > Unconfigured --> StandAlone > Sep 27 14:44:23 nodea kernel: drbd0: drbdsetup [13853]: cstate > StandAlone --> Unconnected > Sep 27 14:44:23 nodea kernel: drbd0: drbd0_receiver [13854]: cstate > Unconnected --> WFConnection > Sep 27 14:44:35 nodea kernel: drbd0: Secondary/Unknown --> Primary/Unknown > Sep 27 14:45:08 nodea kernel: kjournald starting. Commit interval 5 seconds > Sep 27 14:45:08 nodea kernel: EXT3 FS on drbd0, internal journal > Sep 27 14:45:08 nodea kernel: EXT3-fs: mounted filesystem with ordered > data mode. > Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate > WFConnection --> WFReportParams > Sep 27 14:48:00 nodea kernel: drbd0: Handshake successful: DRBD Network > Protocol version 74 > Sep 27 14:48:00 nodea kernel: drbd0: Connection established. > Sep 27 14:48:00 nodea kernel: drbd0: I am(P): > 1:00000002:00000001:00000004:00000002:10 > Sep 27 14:48:00 nodea kernel: drbd0: Peer(S): > 0:00000002:00000001:00000004:00000001:01 > Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate > WFReportParams --> WFBitMapS > Sep 27 14:48:02 nodea kernel: drbd0: Primary/Unknown --> Primary/Secondary > Sep 27 14:48:03 nodea kernel: drbd0: drbd0_receiver [13854]: cstate > WFBitMapS --> SyncSource > Sep 27 14:48:03 nodea kernel: drbd0: Resync started as SyncSource (need > to sync 429834788 KB [107458697 bits set]). > ... > Sep 27 14:55:01 nodea /usr/sbin/cron[14220]: (root) CMD > (/Daten/web-procs/temp_aufraeumen.pl) > > That is the last entry - after that the system is dead. > > Node 2: > > Sep 27 14:47:39 nodeb kernel: drbd0: resync bitmap: bits=107478867 > words=3358716 > Sep 27 14:47:39 nodeb kernel: drbd0: size = 409 GB (429915465 KB) > Sep 27 14:47:44 nodeb kernel: drbd0: 409 GB marked out-of-sync by on > disk bit-map. > Sep 27 14:47:44 nodeb kernel: drbd0: No usable activity log found. > Sep 27 14:47:44 nodeb kernel: drbd0: drbdsetup [8932]: cstate > Unconfigured --> StandAlone > Sep 27 14:48:00 nodeb kernel: drbd0: drbdsetup [8934]: cstate StandAlone > --> Unconnected > Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > Unconnected --> WFConnection > Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > WFConnection --> WFReportParams > Sep 27 14:48:00 nodeb kernel: drbd0: Handshake successful: DRBD Network > Protocol version 74 > Sep 27 14:48:00 nodeb kernel: drbd0: Connection established. > Sep 27 14:48:00 nodeb kernel: drbd0: I am(S): > 0:00000002:00000001:00000004:00000001:01 > Sep 27 14:48:00 nodeb kernel: drbd0: Peer(P): > 1:00000002:00000001:00000004:00000002:10 > Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > WFReportParams --> WFBitMapT > Sep 27 14:48:00 nodeb kernel: drbd0: Secondary/Unknown --> Secondary/Primary > Sep 27 14:48:03 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > WFBitMapT --> SyncTarget > Sep 27 14:48:03 nodeb kernel: drbd0: Resync started as SyncTarget (need > to sync 429834788 KB [107458697 bits set]). > Sep 27 15:00:22 nodeb kernel: drbd0: PingAck did not arrive in time. I think this is causing problem. Ping Ack has not arrived in time which is given in configurations. Try giving larger "ping-int" > Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_asender [8936]: cstate > SyncTarget --> NetworkFailure > Sep 27 15:00:22 nodeb kernel: drbd0: asender terminated > Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > NetworkFailure --> BrokenPipe > Sep 27 15:00:22 nodeb kernel: drbd0: short read receiving data block: > read 2872 expected 4096 > Sep 27 15:00:22 nodeb kernel: drbd0: error receiving RSDataReply, l: 4112! > Sep 27 15:00:22 nodeb kernel: drbd0: worker terminated > Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > BrokenPipe --> Unconnected > Sep 27 15:00:22 nodeb kernel: drbd0: Connection lost. > Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate > Unconnected --> WFConnection > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > > -- Milind "The world is divided into one group: those who start counting at 0, and those who don't."