Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, we have the following setup. Debian Sarge with 2.6.11.10 vanilla kernel and drbd 0.7.10. 2 Hosts with 1GB RAM, P4 2.80GHz, 2xIntelPro1000 Onboard, QLogic fibrechannel controller. Each host is connected to an own fibrechannel raid with 1.4TB disk space on a 0+1 Raid device (/dev/sda). Both are in a heartbeat-setup and running nfs and samba with high read/write access from the clients. The hosts are connected via a network crosslink for heartbeat and drbd traffic and an additional serial crosslink for heartbeat. When one host (lets say 'boston') is drbd primary and the other is 'unconnected', everything is ok. Standalone the server is working without problems (no kernel warings/errors, good performance) over many weeks. If I start drbd on the second host ('newyork') to initialize the sync, everything looks fine. Newyork is starting the sync. May 25 13:45:04 newyork kernel: drbd: initialised. Version: 0.7.10 (api:77/proto:74) May 25 13:45:04 newyork kernel: drbd: SVN Revision: 1743 build by phil at mescal, 2005-01-31 12:22:07 May 25 13:45:04 newyork kernel: drbd: registered as block device major 147 May 25 13:45:04 newyork kernel: drbd0: Creating state block May 25 13:45:04 newyork kernel: klogd 1.4.1, ---------- state change ---------- May 25 13:45:04 newyork kernel: No module symbols loaded - kernel modules not enabled. May 25 13:45:04 newyork kernel: drbd0: resync bitmap: bits=365863671 words=11433240 May 25 13:45:04 newyork kernel: drbd0: size = 1395 GB (1463454684 KB) May 25 13:45:04 newyork kernel: drbd0: Assuming that all blocks are out of sync (aka FullSync) May 25 13:45:25 newyork kernel: drbd0: 1463454684 KB now marked out-of-sync by on disk bit-map. May 25 13:45:25 newyork kernel: drbd0: drbdsetup [2147]: cstate Unconfigured --> StandAlone May 25 13:45:26 newyork kernel: drbd0: drbdsetup [2160]: cstate StandAlone --> Unconnected May 25 13:45:26 newyork kernel: drbd0: drbd0_receiver [2161]: cstate Unconnected --> WFConnection May 25 13:45:26 newyork kernel: drbd0: drbd0_receiver [2161]: cstate WFConnection --> WFReportParams May 25 13:45:26 newyork kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 May 25 13:45:26 newyork kernel: drbd0: Connection established. May 25 13:45:26 newyork kernel: drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00 May 25 13:45:26 newyork kernel: drbd0: Peer(P): 1:0000000d:00000004:00000005:00000004:10 May 25 13:45:26 newyork kernel: drbd0: drbd0_receiver [2161]: cstate WFReportParams --> WFBitMapT May 25 13:45:26 newyork kernel: drbd0: Secondary/Unknown --> Secondary/Primary May 25 13:45:27 newyork kernel: drbd0: drbd0_receiver [2161]: cstate WFBitMapT --> SyncTarget May 25 13:45:27 newyork kernel: drbd0: Resync started as SyncTarget (need to sync 1463454684 KB [365863671 bits set]). May 25 13:45:26 boston kernel: drbd0: drbd0_receiver [2049]: cstate WFConnection --> WFReportParams May 25 13:45:26 boston kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 May 25 13:45:26 boston kernel: drbd0: Connection established. May 25 13:45:26 boston kernel: drbd0: I am(P): 1:0000000d:00000004:00000005:00000004:10 May 25 13:45:26 boston kernel: drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00 May 25 13:45:26 boston kernel: drbd0: drbd0_receiver [2049]: cstate WFReportParams --> WFBitMapS May 25 13:45:27 boston kernel: drbd0: Primary/Unknown --> Primary/Secondary May 25 13:45:27 boston kernel: drbd0: drbd0_receiver [2049]: cstate WFBitMapS --> SyncSource May 25 13:45:27 boston kernel: drbd0: Resync started as SyncSource (need to sync 1463454684 KB [365863671 bits set]). Then only Newyork 279 drbd messages occour: May 25 15:25:08 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 17:05:07 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 18:03:18 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 19:01:08 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 19:38:13 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 19:52:56 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 19:53:05 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 25 19:57:17 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 ... ... May 26 03:03:54 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 26 03:04:05 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295 May 26 03:04:08 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967294 May 26 03:04:11 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967293 May 26 03:04:14 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967292 Then later the sync is ready: May 26 03:06:56 newyork kernel: drbd0: Resync done (total 48096 sec; paused 0 sec; 30424 K/sec) May 26 03:06:56 newyork kernel: drbd0: drbd0_worker [2148]: cstate SyncTarget --> Connected May 26 03:07:01 boston kernel: drbd0: Resync done (total 48095 sec; paused 0 sec; 30428 K/sec) May 26 03:07:01 boston kernel: drbd0: drbd0_worker [2036]: cstate SyncSource --> Connected And then after some hours the primary crashes with a kernel panic (I have no output from that): May 26 18:50:09 newyork kernel: drbd0: PingAck did not arrive in time. May 26 18:50:09 newyork kernel: drbd0: drbd0_asender [2171]: cstate Connected --> NetworkFailure May 26 18:50:09 newyork kernel: drbd0: asender terminated May 26 18:50:09 newyork kernel: drbd0: drbd0_receiver [2161]: cstate NetworkFailure --> BrokenPipe May 26 18:50:09 newyork kernel: drbd0: short read receiving data block: read 568 expected 4096 May 26 18:50:09 newyork kernel: drbd0: error receiving Data, l: 4112! May 26 18:50:09 newyork kernel: drbd0: worker terminated May 26 18:50:09 newyork kernel: drbd0: drbd0_receiver [2161]: cstate BrokenPipe --> Unconnected May 26 18:50:09 newyork kernel: drbd0: Connection lost. May 26 18:50:09 newyork kernel: drbd0: drbd0_receiver [2161]: cstate Unconnected --> WFConnection May 26 18:50:11 newyork kernel: drbd0: Secondary/Unknown --> Primary/Unknown May 26 18:50:12 newyork kernel: ReiserFS: drbd0: found reiserfs format "3.6" with standard journal May 26 18:50:36 newyork kernel: ReiserFS: drbd0: using ordered data mode May 26 18:50:36 newyork kernel: ReiserFS: drbd0: journal params: device drbd0, size 8192, journal first block 18, max trans len 1024, max batch 900, max commi t age 30, max trans age 30 May 26 18:50:36 newyork kernel: ReiserFS: drbd0: checking transaction log (drbd0) May 26 18:50:37 newyork kernel: ReiserFS: drbd0: replayed 8 transactions in 1 seconds May 26 18:50:37 newyork kernel: ReiserFS: drbd0: Using r5 hash to sort names I then resetted Boston from remote and it came up again and move into secondary position after resynced from Newyork. A few hous later Newyork (still the primary) crashed and Boston got master again. Here is my drbd.conf: resource data1 { protocol C; startup { wfc-timeout 120; degr-wfc-timeout 120; } disk { on-io-error detach; } net { } syncer { rate 700000K; group 1; al-extents 1009; } on boston { device /dev/drbd0; disk /dev/sda1; address 192.168.0.2:7788; # crosslink meta-disk internal; } on newyork { device /dev/drbd0; disk /dev/sda1; address 192.168.0.1:7788; # crosslink meta-disk internal; } } Has anyone an idea? I'm not sure if the syncer section is ok and if it is good to use an internal meta-disk on a 1.4 TB device. Thomas