[DRBD-user] Primary dies during sync

KarinMiers k.miers at gsi.de
Thu Sep 28 09:47:39 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

we use drbd (different versions 0.6 to 0.7) for several services and 
never observed any real problems until now - I set up drbd for our web 
server and the primary node died during the first sync without any error 
messages.

Both systems run SuSE 10.0, Kernel 2.6.13-15.10-smp, drbd drbd-0.7.13-2. 
The drbd is on a  raid5  (ICP raid adapter  GDT8623RZ), ext3 file system.

Dis anybody observe a similar behaviour? Is there any known bug to that 
combination of distribution, kernel and drbd which I did not recognize? 
Or could it be a hardware problem? Any hints are welcome... Details of 
the setup and logs are at the bottom of this mail.

Bye,

Karin

-- 
   Dr. Karin A. Miers		Tel.: 	06159-71-1334
   Abtlg. IT			E-Mail:	K.Miers at gsi.de			
								
   GSI mbH 		
   Planckstraße 1						
   64291 Darmstadt						
   Tel.: 0049 - (0)6159 - 71-0	
--
 

Set up was done the first time by commands using default values:

On node 1:

modprobe drbd
drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0
drbdsetup /dev/drbd0 net 10.0.0.1 10.0.0.2 C
drbdsetup /dev/drbd0 primary

On node 2 more or less the same:

modprobe drbd
drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0
drbdsetup /dev/drbd0 net 10.0.0.2 10.0.0.1 C

After that, the sync starts as expected. /proc/drbd looks fine on both 
nodes and drbdsetup /dev/drbd0 state/show too. But after some minutes (7 
to 12 minutes, not reproduceable time, not after a certain amount of 
sync) node 1 is completely dead - just as if it is switched off. node 2 
notices that the other node is dead but apart from this it continues to 
run as usual.

I only tried it twice because node 1 is a production system and should 
not break down too often :-))

First time node 1 stopped after 7 minutes, sync rate was 250 Kb/s (default).

On the second try it stopped after appr. 12 minutes, sync rate was 10000 
KB/s. I increased it because it looked as if it would work.

That is what the log says:

Node 1

Sep 27 14:37:23 nodea kernel: drbd0: drbdsetup [13475]: cstate 
Unconfigured --> Unconnected
Sep 27 14:37:23 nodea kernel: drbd0: drbd0_receiver [13477]: cstate 
Unconnected --> WFConnection
Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate 
WFConnection --> Unconnected
Sep 27 14:43:48 nodea kernel: drbd0: worker terminated
Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate 
Unconnected --> Unconfigured
Sep 27 14:43:48 nodea kernel: drbd0: Connection lost.
Sep 27 14:43:48 nodea kernel: drbd0: Discarding network configuration.
Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate 
Unconfigured --> StandAlone
Sep 27 14:43:48 nodea kernel: drbd0: receiver terminated
Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate 
StandAlone --> Unconfigured
Sep 27 14:44:07 nodea kernel: drbd0: resync bitmap: bits=107478867 
words=3358716
Sep 27 14:44:07 nodea kernel: drbd0: size = 409 GB (429915465 KB)
Sep 27 14:44:11 nodea kernel: drbd0: 409 GB marked out-of-sync by on 
disk bit-map.
Sep 27 14:44:11 nodea kernel: drbd0: Found 4 transactions (136 active 
extents) in activity log.
Sep 27 14:44:11 nodea kernel: drbd0: Marked additional 2048 KB as 
out-of-sync based on AL.
Sep 27 14:44:11 nodea kernel: drbd0: drbdsetup [13851]: cstate 
Unconfigured --> StandAlone
Sep 27 14:44:23 nodea kernel: drbd0: drbdsetup [13853]: cstate 
StandAlone --> Unconnected
Sep 27 14:44:23 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
Unconnected --> WFConnection
Sep 27 14:44:35 nodea kernel: drbd0: Secondary/Unknown --> Primary/Unknown
Sep 27 14:45:08 nodea kernel: kjournald starting.  Commit interval 5 seconds
Sep 27 14:45:08 nodea kernel: EXT3 FS on drbd0, internal journal
Sep 27 14:45:08 nodea kernel: EXT3-fs: mounted filesystem with ordered 
data mode.
Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
WFConnection --> WFReportParams
Sep 27 14:48:00 nodea kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Sep 27 14:48:00 nodea kernel: drbd0: Connection established.
Sep 27 14:48:00 nodea kernel: drbd0: I am(P): 
1:00000002:00000001:00000004:00000002:10
Sep 27 14:48:00 nodea kernel: drbd0: Peer(S): 
0:00000002:00000001:00000004:00000001:01
Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
WFReportParams --> WFBitMapS
Sep 27 14:48:02 nodea kernel: drbd0: Primary/Unknown --> Primary/Secondary
Sep 27 14:48:03 nodea kernel: drbd0: drbd0_receiver [13854]: cstate 
WFBitMapS --> SyncSource
Sep 27 14:48:03 nodea kernel: drbd0: Resync started as SyncSource (need 
to sync 429834788 KB [107458697 bits set]).
...
Sep 27 14:55:01 nodea /usr/sbin/cron[14220]: (root) CMD 
(/Daten/web-procs/temp_aufraeumen.pl)

That is the last entry - after that the system is dead.

Node 2:

Sep 27 14:47:39 nodeb kernel: drbd0: resync bitmap: bits=107478867 
words=3358716
Sep 27 14:47:39 nodeb kernel: drbd0: size = 409 GB (429915465 KB)
Sep 27 14:47:44 nodeb kernel: drbd0: 409 GB marked out-of-sync by on 
disk bit-map.
Sep 27 14:47:44 nodeb kernel: drbd0: No usable activity log found.
Sep 27 14:47:44 nodeb kernel: drbd0: drbdsetup [8932]: cstate 
Unconfigured --> StandAlone
Sep 27 14:48:00 nodeb kernel: drbd0: drbdsetup [8934]: cstate StandAlone 
--> Unconnected
Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
Unconnected --> WFConnection
Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
WFConnection --> WFReportParams
Sep 27 14:48:00 nodeb kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Sep 27 14:48:00 nodeb kernel: drbd0: Connection established.
Sep 27 14:48:00 nodeb kernel: drbd0: I am(S): 
0:00000002:00000001:00000004:00000001:01
Sep 27 14:48:00 nodeb kernel: drbd0: Peer(P): 
1:00000002:00000001:00000004:00000002:10
Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
WFReportParams --> WFBitMapT
Sep 27 14:48:00 nodeb kernel: drbd0: Secondary/Unknown --> Secondary/Primary
Sep 27 14:48:03 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
WFBitMapT --> SyncTarget
Sep 27 14:48:03 nodeb kernel: drbd0: Resync started as SyncTarget (need 
to sync 429834788 KB [107458697 bits set]).
Sep 27 15:00:22 nodeb kernel: drbd0: PingAck did not arrive in time.
Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_asender [8936]: cstate 
SyncTarget --> NetworkFailure
Sep 27 15:00:22 nodeb kernel: drbd0: asender terminated
Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
NetworkFailure --> BrokenPipe
Sep 27 15:00:22 nodeb kernel: drbd0: short read receiving data block: 
read 2872 expected 4096
Sep 27 15:00:22 nodeb kernel: drbd0: error receiving RSDataReply, l: 4112!
Sep 27 15:00:22 nodeb kernel: drbd0: worker terminated
Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
BrokenPipe --> Unconnected
Sep 27 15:00:22 nodeb kernel: drbd0: Connection lost.
Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate 
Unconnected --> WFConnection





More information about the drbd-user mailing list