[DRBD-user] Primary dies during sync

Milind Dumbare milind at linsyssoft.com
Thu Sep 28 10:02:57 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Comments inline.

On Thu, 2006-09-28 at 09:47 +0200, KarinMiers wrote:
> Hi all,
> 
> we use drbd (different versions 0.6 to 0.7) for several services and 
> never observed any real problems until now - I set up drbd for our web 
> server and the primary node died during the first sync without any error 
> messages.
> 
> Both systems run SuSE 10.0, Kernel 2.6.13-15.10-smp, drbd drbd-0.7.13-2. 
> The drbd is on a  raid5  (ICP raid adapter  GDT8623RZ), ext3 file system.
> 
> Dis anybody observe a similar behaviour? Is there any known bug to that 
> combination of distribution, kernel and drbd which I did not recognize? 
> Or could it be a hardware problem? Any hints are welcome... Details of 
> the setup and logs are at the bottom of this mail.
> 
> Bye,
> 
> Karin
> 
> -- 
>    Dr. Karin A. Miers		Tel.: 	06159-71-1334
>    Abtlg. IT			E-Mail:	K.Miers at gsi.de			
> 								
>    GSI mbH 		
>    Planckstraße 1						
>    64291 Darmstadt						
>    Tel.: 0049 - (0)6159 - 71-0	
> --
>  
> 
> Set up was done the first time by commands using default values:
> 
> On node 1:
> 
> modprobe drbd
> drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0
> drbdsetup /dev/drbd0 net 10.0.0.1 10.0.0.2 C
> drbdsetup /dev/drbd0 primary
> 
> On node 2 more or less the same:
> 
> modprobe drbd
> drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0
> drbdsetup /dev/drbd0 net 10.0.0.2 10.0.0.1 C
> 
> After that, the sync starts as expected. /proc/drbd looks fine on both 
> nodes and drbdsetup /dev/drbd0 state/show too. But after some minutes (7 
> to 12 minutes, not reproduceable time, not after a certain amount of 
> sync) node 1 is completely dead - just as if it is switched off. node 2 
> notices that the other node is dead but apart from this it continues to 
> run as usual.
> 
> I only tried it twice because node 1 is a production system and should 
> not break down too often :-))
> 
> First time node 1 stopped after 7 minutes, sync rate was 250 Kb/s (default).
> 
> On the second try it stopped after appr. 12 minutes, sync rate was 10000 
> KB/s. I increased it because it looked as if it would work.
> 
> That is what the log says:
> 
> Node 1
> 
> Sep 27 14:37:23 nodea kernel: drbd0: drbdsetup [13475]: cstate 
> Unconfigured --> Unconnected
> Sep 27 14:37:23 nodea kernel: drbd0: drbd0_receiver [13477]: cstate 
> Unconnected --> WFConnection
> Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate 
> WFConnection --> Unconnected
> Sep 27 14:43:48 nodea kernel: drbd0: worker terminated
> Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate 
> Unconnected --> Unconfigured
> Sep 27 14:43:48 nodea kernel: drbd0: Connection lost.
> Sep 27 14:43:48 nodea kernel: drbd0: Discarding network configuration.
> Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate 
> Unconfigured --> StandAlone
> Sep 27 14:43:48 nodea kernel: drbd0: receiver terminated
> Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate 
> StandAlone --> Unconfigured
> Sep 27 14:44:07 nodea kernel: drbd0: resync bitmap: bits=107478867 
> words=3358716
> Sep 27 14:44:07 nodea kernel: drbd0: size = 409 GB (429915465 KB)
> Sep 27 14:44:11 nodea kernel: drbd0: 409 GB marked out-of-sync by on 
> disk bit-map.
> Sep 27 14:44:11 nodea kernel: drbd0: Found 4 transactions (136 active 
> extents) in activity log.
> Sep 27 14:44:11 nodea kernel: drbd0: Marked additional 2048 KB as 
> out-of-sync based on AL.
> Sep 27 14:44:11 nodea kernel: drbd0: drbdsetup [13851]: cstate 
> Unconfigured --> StandAlone
> Sep 27 14:44:23 nodea kernel: drbd0: drbdsetup [13853]: cstate 
> StandAlone --> Unconnected
> Sep 27 14:44:23 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
> Unconnected --> WFConnection
> Sep 27 14:44:35 nodea kernel: drbd0: Secondary/Unknown --> Primary/Unknown
> Sep 27 14:45:08 nodea kernel: kjournald starting.  Commit interval 5 seconds
> Sep 27 14:45:08 nodea kernel: EXT3 FS on drbd0, internal journal
> Sep 27 14:45:08 nodea kernel: EXT3-fs: mounted filesystem with ordered 
> data mode.
> Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
> WFConnection --> WFReportParams
> Sep 27 14:48:00 nodea kernel: drbd0: Handshake successful: DRBD Network 
> Protocol version 74
> Sep 27 14:48:00 nodea kernel: drbd0: Connection established.
> Sep 27 14:48:00 nodea kernel: drbd0: I am(P): 
> 1:00000002:00000001:00000004:00000002:10
> Sep 27 14:48:00 nodea kernel: drbd0: Peer(S): 
> 0:00000002:00000001:00000004:00000001:01
> Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
> WFReportParams --> WFBitMapS
> Sep 27 14:48:02 nodea kernel: drbd0: Primary/Unknown --> Primary/Secondary
> Sep 27 14:48:03 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
> WFBitMapS --> SyncSource
> Sep 27 14:48:03 nodea kernel: drbd0: Resync started as SyncSource (need 
> to sync 429834788 KB [107458697 bits set]).
> ...
> Sep 27 14:55:01 nodea /usr/sbin/cron[14220]: (root) CMD 
> (/Daten/web-procs/temp_aufraeumen.pl)
> 
> That is the last entry - after that the system is dead.
> 
> Node 2:
> 
> Sep 27 14:47:39 nodeb kernel: drbd0: resync bitmap: bits=107478867 
> words=3358716
> Sep 27 14:47:39 nodeb kernel: drbd0: size = 409 GB (429915465 KB)
> Sep 27 14:47:44 nodeb kernel: drbd0: 409 GB marked out-of-sync by on 
> disk bit-map.
> Sep 27 14:47:44 nodeb kernel: drbd0: No usable activity log found.
> Sep 27 14:47:44 nodeb kernel: drbd0: drbdsetup [8932]: cstate 
> Unconfigured --> StandAlone
> Sep 27 14:48:00 nodeb kernel: drbd0: drbdsetup [8934]: cstate StandAlone 
> --> Unconnected
> Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> Unconnected --> WFConnection
> Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> WFConnection --> WFReportParams
> Sep 27 14:48:00 nodeb kernel: drbd0: Handshake successful: DRBD Network 
> Protocol version 74
> Sep 27 14:48:00 nodeb kernel: drbd0: Connection established.
> Sep 27 14:48:00 nodeb kernel: drbd0: I am(S): 
> 0:00000002:00000001:00000004:00000001:01
> Sep 27 14:48:00 nodeb kernel: drbd0: Peer(P): 
> 1:00000002:00000001:00000004:00000002:10
> Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> WFReportParams --> WFBitMapT
> Sep 27 14:48:00 nodeb kernel: drbd0: Secondary/Unknown --> Secondary/Primary
> Sep 27 14:48:03 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> WFBitMapT --> SyncTarget
> Sep 27 14:48:03 nodeb kernel: drbd0: Resync started as SyncTarget (need 
> to sync 429834788 KB [107458697 bits set]).
> Sep 27 15:00:22 nodeb kernel: drbd0: PingAck did not arrive in time.

I think this is causing problem. Ping Ack has not arrived in time which
is given in configurations. Try giving larger "ping-int"

> Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_asender [8936]: cstate 
> SyncTarget --> NetworkFailure
> Sep 27 15:00:22 nodeb kernel: drbd0: asender terminated
> Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> NetworkFailure --> BrokenPipe
> Sep 27 15:00:22 nodeb kernel: drbd0: short read receiving data block: 
> read 2872 expected 4096
> Sep 27 15:00:22 nodeb kernel: drbd0: error receiving RSDataReply, l: 4112!
> Sep 27 15:00:22 nodeb kernel: drbd0: worker terminated
> Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> BrokenPipe --> Unconnected
> Sep 27 15:00:22 nodeb kernel: drbd0: Connection lost.
> Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
> Unconnected --> WFConnection
> 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> 
-- 
Milind
"The world is divided into one group: those who start counting at 0,
and those who don't."




More information about the drbd-user mailing list