[Drbd-dev] freeze under postmark io bench

Laurent Denel ldenel at gmail.com
Thu Mar 23 09:08:21 CET 2006


Hi,

Quick answer, I'll try to get some more time on that later today.

> what do you mean by "crashed the host"?

The host is inside a server room, I don't have any quick access to it
yet, I'll configure the iLO remote management stuff later today. So
far the crash is :
- ssh console frozen
- no more ssh access (timeout)
- ping still ok
- "Unknown" status for the host seen on the other node
- automated reboot by HP's bios after 15 min.

> any output anywhere?

nothing into the syslog files as usual when the kernel has hard time ;)
but here is the output of drdb before the crash (maybe to validate my conf) :

Mar 22 19:03:42 bwinf5401 kernel: drbd: initialised. Version: 0.7.17
(api:77/proto:74)
Mar 22 19:03:42 bwinf5401 kernel: drbd: SVN Revision: 2111 build by
root at wingate03, 2006-03-20 14:26:07
Mar 22 19:03:42 bwinf5401 kernel: drbd: registered as block device major 147
Mar 22 19:04:54 bwinf5401 kernel: drbd0: resync bitmap: bits=18578276
words=580572
Mar 22 19:04:54 bwinf5401 kernel: drbd0: size = 70 GB (74313104 KB)
Mar 22 19:04:55 bwinf5401 kernel: drbd0: 0 KB marked out-of-sync by on
disk bit-map.
Mar 22 19:04:55 bwinf5401 kernel: drbd0: Found 6 transactions (324
active extents) in activity log.
Mar 22 19:04:55 bwinf5401 kernel: drbd0: drbdsetup [1850]: cstate
Unconfigured --> StandAlone
Mar 22 19:04:55 bwinf5401 kernel: drbd0: drbdsetup [1853]: cstate
StandAlone --> Unconnected
Mar 22 19:04:55 bwinf5401 kernel: drbd0: drbd0_receiver [1854]: cstate
Unconnected --> WFConnection
Mar 22 19:06:29 bwinf5401 kernel: bcm5700: eth1 NIC Link is Down
Mar 22 19:06:40 bwinf5401 kernel: bcm5700: eth1 NIC Link is Up, 1000
Mbps full duplex
Mar 22 19:07:55 bwinf5401 kernel: bcm5700: eth1 NIC Link is Down
Mar 22 19:07:57 bwinf5401 kernel: bcm5700: eth1 NIC Link is Up, 1000
Mbps full duplex, receive & transmit flow control ON
Mar 22 19:09:23 bwinf5401 kernel: drbd0: drbd0_receiver [1854]: cstate
WFConnection --> WFReportParams
Mar 22 19:09:23 bwinf5401 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74
Mar 22 19:09:23 bwinf5401 kernel: drbd0: Connection established.
Mar 22 19:09:23 bwinf5401 kernel: drbd0: I am(S):
1:00000002:00000001:0000000e:00000005:00
Mar 22 19:09:23 bwinf5401 kernel: drbd0: Peer(S):
1:00000002:00000001:0000000e:00000005:00
Mar 22 19:09:23 bwinf5401 kernel: drbd0: drbd0_receiver [1854]: cstate
WFReportParams --> Connected
Mar 22 19:09:23 bwinf5401 kernel: drbd0: Secondary/Unknown -->
Secondary/Secondary
Mar 22 19:09:53 bwinf5401 kernel: drbd0: drbdsetup [2129]: cstate
Connected --> Unconnected
Mar 22 19:09:53 bwinf5401 kernel: drbd0: drbd0_receiver [1854]: cstate
Unconnected --> BrokenPipe
Mar 22 19:09:53 bwinf5401 kernel: drbd0: short read expecting header
on sock: r=-512
Mar 22 19:09:53 bwinf5401 kernel: drbd0: worker terminated
Mar 22 19:09:53 bwinf5401 kernel: drbd0: asender terminated
Mar 22 19:09:53 bwinf5401 kernel: drbd0: drbd0_receiver [1854]: cstate
BrokenPipe --> StandAlone
Mar 22 19:09:53 bwinf5401 kernel: drbd0: Connection lost.
Mar 22 19:09:53 bwinf5401 kernel: drbd0: receiver terminated
Mar 22 19:09:53 bwinf5401 kernel: drbd0: drbdsetup [2129]: cstate
StandAlone --> StandAlone
Mar 22 19:09:53 bwinf5401 kernel: drbd0: drbdsetup [2129]: cstate
StandAlone --> Unconfigured
Mar 22 19:09:53 bwinf5401 kernel: drbd0: worker terminated
Mar 22 19:10:05 bwinf5401 kernel: drbd0: resync bitmap: bits=18578276
words=580572
Mar 22 19:10:05 bwinf5401 kernel: drbd0: size = 70 GB (74313104 KB)
Mar 22 19:10:06 bwinf5401 kernel: drbd0: 0 KB marked out-of-sync by on
disk bit-map.
Mar 22 19:10:06 bwinf5401 kernel: drbd0: Found 6 transactions (324
active extents) in activity log.
Mar 22 19:10:06 bwinf5401 kernel: drbd0: drbdsetup [2194]: cstate
Unconfigured --> StandAlone
Mar 22 19:10:06 bwinf5401 kernel: drbd0: drbdsetup [2197]: cstate
StandAlone --> Unconnected
Mar 22 19:10:06 bwinf5401 kernel: drbd0: drbd0_receiver [2198]: cstate
Unconnected --> WFConnection
Mar 22 19:10:06 bwinf5401 kernel: drbd0: drbd0_receiver [2198]: cstate
WFConnection --> WFReportParams
Mar 22 19:10:06 bwinf5401 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74
Mar 22 19:10:06 bwinf5401 kernel: drbd0: Connection established.
Mar 22 19:10:06 bwinf5401 kernel: drbd0: I am(S):
1:00000002:00000001:0000000e:00000005:01
Mar 22 19:10:06 bwinf5401 kernel: drbd0: Peer(S):
1:00000002:00000001:0000000e:00000005:01
Mar 22 19:10:06 bwinf5401 kernel: drbd0: drbd0_receiver [2198]: cstate
WFReportParams --> Connected
Mar 22 19:10:06 bwinf5401 kernel: drbd0: Secondary/Unknown -->
Secondary/Secondary
Mar 22 19:10:31 bwinf5401 kernel: drbd0: Secondary/Secondary -->
Primary/Secondary


> any reaction to ping, keyboard, numlock-led-toggle, sysrq keys?
> nmi-watchdog enabled? does it trigger?
> are you comfortable to recompile the kernel with the kernel debugger enabled,
> and go into the kernel debugger when it "crashes" again, and see what it
> thinks it does, where it thinks it is stuck?

no problem to recompile a kernel with the debugger
this will be my first time using it but I'm looking forward to learn...

LD


More information about the drbd-dev mailing list