[DRBD-user] kernel panic on primary server

Thomas Böhme T.Boehme at mc-wetter.de
Fri Jun 3 19:41:42 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

we have the following setup. Debian Sarge with 2.6.11.10 vanilla kernel and drbd 0.7.10.

2 Hosts with 1GB RAM, P4 2.80GHz, 2xIntelPro1000 Onboard, QLogic fibrechannel controller. Each host is connected to an own fibrechannel raid with 1.4TB disk space on a 0+1 Raid device (/dev/sda). Both are in a heartbeat-setup and running nfs and samba with high read/write access from the clients. The hosts are connected via a network crosslink for heartbeat and drbd traffic and an additional serial crosslink for heartbeat.

When one host (lets say 'boston') is drbd primary and the other is 'unconnected', everything is ok. Standalone the server is working without problems (no kernel warings/errors, good performance) over many weeks.

If I start drbd on the second host ('newyork') to initialize the sync, everything looks fine. Newyork is starting the sync.

May 25 13:45:04 newyork kernel: drbd: initialised. Version: 0.7.10 (api:77/proto:74)
May 25 13:45:04 newyork kernel: drbd: SVN Revision: 1743 build by phil at mescal, 2005-01-31 12:22:07
May 25 13:45:04 newyork kernel: drbd: registered as block device major 147
May 25 13:45:04 newyork kernel: drbd0: Creating state block
May 25 13:45:04 newyork kernel: klogd 1.4.1, ---------- state change ----------
May 25 13:45:04 newyork kernel: No module symbols loaded - kernel modules not enabled.
May 25 13:45:04 newyork kernel: drbd0: resync bitmap: bits=365863671 words=11433240
May 25 13:45:04 newyork kernel: drbd0: size = 1395 GB (1463454684 KB)
May 25 13:45:04 newyork kernel: drbd0: Assuming that all blocks are out of sync (aka FullSync)
May 25 13:45:25 newyork kernel: drbd0: 1463454684 KB now marked out-of-sync by on disk bit-map.
May 25 13:45:25 newyork kernel: drbd0: drbdsetup [2147]: cstate Unconfigured --> StandAlone
May 25 13:45:26 newyork kernel: drbd0: drbdsetup [2160]: cstate StandAlone --> Unconnected
May 25 13:45:26 newyork kernel: drbd0: drbd0_receiver [2161]: cstate Unconnected --> WFConnection
May 25 13:45:26 newyork kernel: drbd0: drbd0_receiver [2161]: cstate WFConnection --> WFReportParams
May 25 13:45:26 newyork kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
May 25 13:45:26 newyork kernel: drbd0: Connection established.
May 25 13:45:26 newyork kernel: drbd0: I am(S): 0:00000001:00000001:00000001:00000001:00
May 25 13:45:26 newyork kernel: drbd0: Peer(P): 1:0000000d:00000004:00000005:00000004:10
May 25 13:45:26 newyork kernel: drbd0: drbd0_receiver [2161]: cstate WFReportParams --> WFBitMapT
May 25 13:45:26 newyork kernel: drbd0: Secondary/Unknown --> Secondary/Primary
May 25 13:45:27 newyork kernel: drbd0: drbd0_receiver [2161]: cstate WFBitMapT --> SyncTarget
May 25 13:45:27 newyork kernel: drbd0: Resync started as SyncTarget (need to sync 1463454684 KB [365863671 bits set]).

May 25 13:45:26 boston kernel: drbd0: drbd0_receiver [2049]: cstate WFConnection --> WFReportParams
May 25 13:45:26 boston kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
May 25 13:45:26 boston kernel: drbd0: Connection established.
May 25 13:45:26 boston kernel: drbd0: I am(P): 1:0000000d:00000004:00000005:00000004:10
May 25 13:45:26 boston kernel: drbd0: Peer(S): 0:00000001:00000001:00000001:00000001:00
May 25 13:45:26 boston kernel: drbd0: drbd0_receiver [2049]: cstate WFReportParams --> WFBitMapS
May 25 13:45:27 boston kernel: drbd0: Primary/Unknown --> Primary/Secondary
May 25 13:45:27 boston kernel: drbd0: drbd0_receiver [2049]: cstate WFBitMapS --> SyncSource
May 25 13:45:27 boston kernel: drbd0: Resync started as SyncSource (need to sync 1463454684 KB [365863671 bits set]).

Then only Newyork 279 drbd messages occour:

May 25 15:25:08 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 17:05:07 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 18:03:18 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 19:01:08 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 19:38:13 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 19:52:56 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 19:53:05 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 25 19:57:17 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
...
...
May 26 03:03:54 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 26 03:04:05 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967295
May 26 03:04:08 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967294
May 26 03:04:11 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967293
May 26 03:04:14 newyork kernel: drbd0: [drbd0_worker/2148] sock_sendmsg time expired, ko = 4294967292


Then later the sync is ready:

May 26 03:06:56 newyork kernel: drbd0: Resync done (total 48096 sec; paused 0 sec; 30424 K/sec)
May 26 03:06:56 newyork kernel: drbd0: drbd0_worker [2148]: cstate SyncTarget --> Connected

May 26 03:07:01 boston kernel: drbd0: Resync done (total 48095 sec; paused 0 sec; 30428 K/sec)
May 26 03:07:01 boston kernel: drbd0: drbd0_worker [2036]: cstate SyncSource --> Connected



And then after some hours the primary crashes with a kernel panic (I have no output from that):

May 26 18:50:09 newyork kernel: drbd0: PingAck did not arrive in time.
May 26 18:50:09 newyork kernel: drbd0: drbd0_asender [2171]: cstate Connected --> NetworkFailure
May 26 18:50:09 newyork kernel: drbd0: asender terminated
May 26 18:50:09 newyork kernel: drbd0: drbd0_receiver [2161]: cstate NetworkFailure --> BrokenPipe
May 26 18:50:09 newyork kernel: drbd0: short read receiving data block: read 568 expected 4096
May 26 18:50:09 newyork kernel: drbd0: error receiving Data, l: 4112!
May 26 18:50:09 newyork kernel: drbd0: worker terminated
May 26 18:50:09 newyork kernel: drbd0: drbd0_receiver [2161]: cstate BrokenPipe --> Unconnected
May 26 18:50:09 newyork kernel: drbd0: Connection lost.
May 26 18:50:09 newyork kernel: drbd0: drbd0_receiver [2161]: cstate Unconnected --> WFConnection
May 26 18:50:11 newyork kernel: drbd0: Secondary/Unknown --> Primary/Unknown
May 26 18:50:12 newyork kernel: ReiserFS: drbd0: found reiserfs format "3.6" with standard journal
May 26 18:50:36 newyork kernel: ReiserFS: drbd0: using ordered data mode
May 26 18:50:36 newyork kernel: ReiserFS: drbd0: journal params: device drbd0, size 8192, journal first block 18, max trans len 1024, max batch 900, max commi
t age 30, max trans age 30
May 26 18:50:36 newyork kernel: ReiserFS: drbd0: checking transaction log (drbd0)
May 26 18:50:37 newyork kernel: ReiserFS: drbd0: replayed 8 transactions in 1 seconds
May 26 18:50:37 newyork kernel: ReiserFS: drbd0: Using r5 hash to sort names


I then resetted Boston from remote and it came up again and move into secondary position after resynced from Newyork.


A few hous later Newyork (still the primary) crashed and Boston got master again.



Here is my drbd.conf:

resource data1 {
  protocol C;

  startup {
    wfc-timeout  120;
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
  }

  net {
  }

  syncer {
    rate 700000K;
    group 1;
    al-extents 1009;
  }

  on boston {
    device     /dev/drbd0;
    disk       /dev/sda1;
    address    192.168.0.2:7788;	# crosslink
    meta-disk  internal;
  }

  on newyork {
    device    /dev/drbd0;
    disk      /dev/sda1;
    address   192.168.0.1:7788;	# crosslink
    meta-disk internal;
  }
}


Has anyone an idea? I'm not sure if the syncer section is ok and if it is good to use an internal meta-disk on a 1.4 TB device.


Thomas



More information about the drbd-user mailing list