Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, we use drbd (different versions 0.6 to 0.7) for several services and never observed any real problems until now - I set up drbd for our web server and the primary node died during the first sync without any error messages. Both systems run SuSE 10.0, Kernel 2.6.13-15.10-smp, drbd drbd-0.7.13-2. The drbd is on a raid5 (ICP raid adapter GDT8623RZ), ext3 file system. Dis anybody observe a similar behaviour? Is there any known bug to that combination of distribution, kernel and drbd which I did not recognize? Or could it be a hardware problem? Any hints are welcome... Details of the setup and logs are at the bottom of this mail. Bye, Karin -- Dr. Karin A. Miers Tel.: 06159-71-1334 Abtlg. IT E-Mail: K.Miers at gsi.de GSI mbH Planckstraße 1 64291 Darmstadt Tel.: 0049 - (0)6159 - 71-0 -- Set up was done the first time by commands using default values: On node 1: modprobe drbd drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0 drbdsetup /dev/drbd0 net 10.0.0.1 10.0.0.2 C drbdsetup /dev/drbd0 primary On node 2 more or less the same: modprobe drbd drbdsetup /dev/drbd0 disk /dev/sdb2 /dev/sdb1 0 drbdsetup /dev/drbd0 net 10.0.0.2 10.0.0.1 C After that, the sync starts as expected. /proc/drbd looks fine on both nodes and drbdsetup /dev/drbd0 state/show too. But after some minutes (7 to 12 minutes, not reproduceable time, not after a certain amount of sync) node 1 is completely dead - just as if it is switched off. node 2 notices that the other node is dead but apart from this it continues to run as usual. I only tried it twice because node 1 is a production system and should not break down too often :-)) First time node 1 stopped after 7 minutes, sync rate was 250 Kb/s (default). On the second try it stopped after appr. 12 minutes, sync rate was 10000 KB/s. I increased it because it looked as if it would work. That is what the log says: Node 1 Sep 27 14:37:23 nodea kernel: drbd0: drbdsetup [13475]: cstate Unconfigured --> Unconnected Sep 27 14:37:23 nodea kernel: drbd0: drbd0_receiver [13477]: cstate Unconnected --> WFConnection Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate WFConnection --> Unconnected Sep 27 14:43:48 nodea kernel: drbd0: worker terminated Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate Unconnected --> Unconfigured Sep 27 14:43:48 nodea kernel: drbd0: Connection lost. Sep 27 14:43:48 nodea kernel: drbd0: Discarding network configuration. Sep 27 14:43:48 nodea kernel: drbd0: drbd0_receiver [13477]: cstate Unconfigured --> StandAlone Sep 27 14:43:48 nodea kernel: drbd0: receiver terminated Sep 27 14:43:48 nodea kernel: drbd0: drbdsetup [13849]: cstate StandAlone --> Unconfigured Sep 27 14:44:07 nodea kernel: drbd0: resync bitmap: bits=107478867 words=3358716 Sep 27 14:44:07 nodea kernel: drbd0: size = 409 GB (429915465 KB) Sep 27 14:44:11 nodea kernel: drbd0: 409 GB marked out-of-sync by on disk bit-map. Sep 27 14:44:11 nodea kernel: drbd0: Found 4 transactions (136 active extents) in activity log. Sep 27 14:44:11 nodea kernel: drbd0: Marked additional 2048 KB as out-of-sync based on AL. Sep 27 14:44:11 nodea kernel: drbd0: drbdsetup [13851]: cstate Unconfigured --> StandAlone Sep 27 14:44:23 nodea kernel: drbd0: drbdsetup [13853]: cstate StandAlone --> Unconnected Sep 27 14:44:23 nodea kernel: drbd0: drbd0_receiver [13854]: cstate Unconnected --> WFConnection Sep 27 14:44:35 nodea kernel: drbd0: Secondary/Unknown --> Primary/Unknown Sep 27 14:45:08 nodea kernel: kjournald starting. Commit interval 5 seconds Sep 27 14:45:08 nodea kernel: EXT3 FS on drbd0, internal journal Sep 27 14:45:08 nodea kernel: EXT3-fs: mounted filesystem with ordered data mode. Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate WFConnection --> WFReportParams Sep 27 14:48:00 nodea kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 27 14:48:00 nodea kernel: drbd0: Connection established. Sep 27 14:48:00 nodea kernel: drbd0: I am(P): 1:00000002:00000001:00000004:00000002:10 Sep 27 14:48:00 nodea kernel: drbd0: Peer(S): 0:00000002:00000001:00000004:00000001:01 Sep 27 14:48:00 nodea kernel: drbd0: drbd0_receiver [13854]: cstate WFReportParams --> WFBitMapS Sep 27 14:48:02 nodea kernel: drbd0: Primary/Unknown --> Primary/Secondary Sep 27 14:48:03 nodea kernel: drbd0: drbd0_receiver [13854]: cstate WFBitMapS --> SyncSource Sep 27 14:48:03 nodea kernel: drbd0: Resync started as SyncSource (need to sync 429834788 KB [107458697 bits set]). ... Sep 27 14:55:01 nodea /usr/sbin/cron[14220]: (root) CMD (/Daten/web-procs/temp_aufraeumen.pl) That is the last entry - after that the system is dead. Node 2: Sep 27 14:47:39 nodeb kernel: drbd0: resync bitmap: bits=107478867 words=3358716 Sep 27 14:47:39 nodeb kernel: drbd0: size = 409 GB (429915465 KB) Sep 27 14:47:44 nodeb kernel: drbd0: 409 GB marked out-of-sync by on disk bit-map. Sep 27 14:47:44 nodeb kernel: drbd0: No usable activity log found. Sep 27 14:47:44 nodeb kernel: drbd0: drbdsetup [8932]: cstate Unconfigured --> StandAlone Sep 27 14:48:00 nodeb kernel: drbd0: drbdsetup [8934]: cstate StandAlone --> Unconnected Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate Unconnected --> WFConnection Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate WFConnection --> WFReportParams Sep 27 14:48:00 nodeb kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 27 14:48:00 nodeb kernel: drbd0: Connection established. Sep 27 14:48:00 nodeb kernel: drbd0: I am(S): 0:00000002:00000001:00000004:00000001:01 Sep 27 14:48:00 nodeb kernel: drbd0: Peer(P): 1:00000002:00000001:00000004:00000002:10 Sep 27 14:48:00 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate WFReportParams --> WFBitMapT Sep 27 14:48:00 nodeb kernel: drbd0: Secondary/Unknown --> Secondary/Primary Sep 27 14:48:03 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate WFBitMapT --> SyncTarget Sep 27 14:48:03 nodeb kernel: drbd0: Resync started as SyncTarget (need to sync 429834788 KB [107458697 bits set]). Sep 27 15:00:22 nodeb kernel: drbd0: PingAck did not arrive in time. Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_asender [8936]: cstate SyncTarget --> NetworkFailure Sep 27 15:00:22 nodeb kernel: drbd0: asender terminated Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate NetworkFailure --> BrokenPipe Sep 27 15:00:22 nodeb kernel: drbd0: short read receiving data block: read 2872 expected 4096 Sep 27 15:00:22 nodeb kernel: drbd0: error receiving RSDataReply, l: 4112! Sep 27 15:00:22 nodeb kernel: drbd0: worker terminated Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate BrokenPipe --> Unconnected Sep 27 15:00:22 nodeb kernel: drbd0: Connection lost. Sep 27 15:00:22 nodeb kernel: drbd0: drbd0_receiver [8935]: cstate Unconnected --> WFConnection