Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have experienced a failure for DRBD to reconnect after a system forced power off/on due to a hung host. DRBD 0.7.21 Hearbeat 2.1.2 DRBD log output Sep 20 17:58:06 sinfids3a1 drbd: Stopping all DRBD resources Sep 20 17:58:06 sinfids3a1 drbd: ioctl(,SET_STATE,) failed: The host - on it's way down. At this point the filesystem was still active as Heartbeat didn't stop successfully. Is there a way for drbd to handle this? Sep 20 17:58:06 sinfids3a1 drbd: Device or resource busy Sep 20 17:58:06 sinfids3a1 drbd: Someone has opened the device for RW access! Sep 20 17:58:06 sinfids3a1 drbd: Command '/sbin/drbdsetup /dev/drbd0 down' terminated with exit code 20 Sep 20 17:58:06 sinfids3a1 drbd: drbdsetup exited with code 20 Sep 20 17:58:06 sinfids3a1 drbd: ERROR: Module drbd is in use Sep 20 17:58:06 sinfids3a1 drbd: . Sep 20 17:58:06 sinfids3a1 rc: Stopping drbd: succeeded The host - on it's way up. Sep 21 10:11:49 sinfids3a1 drbd: [ Sep 21 10:11:49 sinfids3a1 drbd: d0 Sep 21 10:11:49 sinfids3a1 kernel: drbd0: Adjusting my ra_pages to backing device's (32 -> 1024) Sep 21 10:11:49 sinfids3a1 kernel: klogd 1.4.1, ---------- state change ---------- Sep 21 10:11:49 sinfids3a1 kernel: drbd0: resync bitmap: bits=43350764 words=677356 Sep 21 10:11:49 sinfids3a1 kernel: drbd0: size = 165 GB (173403056 KB) Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy Sep 21 10:11:51 sinfids3a1 kernel: drbd0: 180 MB marked out-of-sync by on disk bit-map. Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Found 6 transactions (324 active extents) in activity log. Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Marked additional 983 MB as out-of-sync based on AL. Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5217]: cstate Unconfigured --> StandAlone Sep 21 10:11:51 sinfids3a1 drbd: s0 Sep 21 10:11:51 sinfids3a1 drbd: n0 Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5318]: cstate StandAlone --> Unconnected Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection Sep 21 10:11:51 sinfids3a1 drbd: ] Sep 21 10:11:51 sinfids3a1 drbd: . Sep 21 10:11:51 sinfids3a1 drbd: WARN: stdin/stdout is not a TTY; using /dev/console Sep 21 10:11:51 sinfids3a1 kernel: drbd0: using degr_wfc_timeout=10 seconds Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFConnection --> WFReportParams Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection established. Sep 21 10:11:53 sinfids3a1 kernel: drbd0: I am(S): 1:00000002:00000001:000000c5:00000033:10 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Peer(P): 1:00000002:00000001:000000c3:00000034:10 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFReportParams --> WFBitMapS Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock_sendmsg returned -32 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFBitMapS --> BrokenPipe Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short sent ReportBitMap size=4096 sent=0 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary It has detected it's partner but then breaks the connection. This has resulted in serious data loss. Is there a way to prevent this? Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by peer. Sep 21 10:11:53 sinfids3a1 kernel: drbd0: asender terminated Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock was shut down by peer Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> BrokenPipe Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short read expecting header on sock: r=0 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: worker terminated Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> Unconnected Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection lost. Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection Sep 21 10:12:01 sinfids3a1 rc: Starting drbd: succeeded The scenario: sinfids3a1 running as master host sinfids3a1 hung reboot sinfids3a1 sinfids3a2 running as master updated filesystem (nagios config) powered off/on sinfids3a1 [.....] failed system over to sinfids3a1 updates to the filesystem were LOST !!! Sep 20 17:56:50 sinfids3a1 rc: Stopping cmanic: succeeded Sep 20 17:56:50 sinfids3a1 heartbeat: Stopping High-Availability services: Sep 20 17:56:50 sinfids3a1 heartbeat: [5398]: info: killing /usr/lib64/heartbeat/mgmtd -v process group 5533 with signal 15 Sep 20 17:56:50 sinfids3a1 mgmtd: [5533]: info: mgmtd is shutting down Sep 20 17:56:50 sinfids3a1 heartbeat: [5398]: info: killing /usr/lib64/heartbeat/crmd process group 5532 with signal 15 Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: crm_shutdown: Requesting shutdown Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: do_shutdown_req: Sending shutdown request to DC: <null> Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: do_shutdown_req: Processing shutdown locally Sep 20 17:56:50 sinfids3a1 crmd: [5532]: ERROR: send_ha_message: Not connected to Heartbeat Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: crm_log_message_adv: #========= HA[outbound] message start ==========# Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG: Dumping message with 10 fields Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[0] : [origin=do_election_vote] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[1] : [t=crmd] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[2] : [version=1.0.9] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[3] : [subt=request] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[4] : [reference=vote-crmd-1190282210-138] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[5] : [crm_task=vote] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[6] : [crm_sys_to=crmd] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[7] : [crm_sys_from=crmd] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[8] : [election-owner=3e940b7a-0e41-4df4-9279-71205d5038e8] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: MSG[9] : [election-id=15] Sep 20 17:56:50 sinfids3a1 crmd: [5532]: WARN: send_msg_via_ha: Sending broadcast HA message (ref=vote-crmd-1190282210-138) to crmd@<all> failed. Sep 20 17:56:50 sinfids3a1 crmd: [5532]: info: handle_shutdown_request: Creating shutdown request for sinfids3a1 Sep 20 17:56:50 sinfids3a1 tengine: [29019]: info: extract_event: Aborting on shutdown attribute for 3e940b7a-0e41-4df4-9279-71205d5038e8 Sep 20 17:57:26 sinfids3a1 sinfids3a_apache[6867]: [6959]: INFO: 17:57:26 URL:http://0.0.0.0:80/server-status [2765/2765] -> "-" [1] Sep 20 17:57:27 sinfids3a1 lrmd: [5529]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource: 0x51bdc8) Sep 20 17:57:27 sinfids3a1 lrmd: [5529]: info: G_SIG_dispatch: started at 455233814 should have started at 455233714 Sep 20 17:57:31 sinfids3a1 tengine: [29019]: info: process_graph_event: Action resource_sinfids3a_fids_proxy_monitor_360000 initiated by a different transitioner Sep 20 17:57:31 sinfids3a1 tengine: [29019]: WARN: update_failcount: Updating failcount for resource_sinfids3a_fids_proxy on 8cc209ea-9be8-48a7-a1a1-3fdc7b7a64d4 after failed monitor: rc=7 Sep 20 17:57:32 sinfids3a1 tengine: [29019]: info: extract_event: Aborting on transient_attributes changes for 8cc209ea-9be8-48a7-a1a1-3fdc7b7a64d4 Sep 20 17:57:32 sinfids3a1 tengine: [29019]: info: te_update_diff: Aborting on transient_attributes deletions Sep 20 17:57:33 sinfids3a1 tengine: [29019]: info: extract_event: Aborting on transient_attributes changes for 8cc209ea-9be8-48a7-a1a1-3fdc7b7a64d4 Sep 20 17:57:33 sinfids3a1 tengine: [29019]: info: te_update_diff: Aborting on transient_attributes deletions Sep 20 17:57:50 sinfids3a1 rc: Stopping heartbeat: failed Sep 20 17:58:06 sinfids3a1 drbd: Stopping all DRBD resources Sep 20 17:58:06 sinfids3a1 drbd: ioctl(,SET_STATE,) failed: Sep 20 17:58:06 sinfids3a1 drbd: Device or resource busy Sep 20 17:58:06 sinfids3a1 drbd: Someone has opened the device for RW access! Sep 20 17:58:06 sinfids3a1 drbd: Command '/sbin/drbdsetup /dev/drbd0 down' terminated with exit code 20 Sep 20 17:58:06 sinfids3a1 drbd: drbdsetup exited with code 20 Sep 20 17:58:06 sinfids3a1 drbd: ERROR: Module drbd is in use Sep 20 17:58:06 sinfids3a1 drbd: . Sep 20 17:58:06 sinfids3a1 rc: Stopping drbd: succeeded Sep 21 10:11:49 sinfids3a1 drbd: [ Sep 21 10:11:49 sinfids3a1 drbd: d0 Sep 21 10:11:49 sinfids3a1 kernel: drbd0: Adjusting my ra_pages to backing device's (32 -> 1024) Sep 21 10:11:49 sinfids3a1 kernel: klogd 1.4.1, ---------- state change ---------- Sep 21 10:11:49 sinfids3a1 kernel: drbd0: resync bitmap: bits=43350764 words=677356 Sep 21 10:11:49 sinfids3a1 kernel: drbd0: size = 165 GB (173403056 KB) Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc0: IOCStatus(0x0002): Busy Sep 21 10:11:49 sinfids3a1 kernel: mptbase: ioc1: IOCStatus(0x0002): Busy Sep 21 10:11:51 sinfids3a1 kernel: drbd0: 180 MB marked out-of-sync by on disk bit-map. Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Found 6 transactions (324 active extents) in activity log. Sep 21 10:11:51 sinfids3a1 kernel: drbd0: Marked additional 983 MB as out-of-sync based on AL. Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5217]: cstate Unconfigured --> StandAlone Sep 21 10:11:51 sinfids3a1 drbd: s0 Sep 21 10:11:51 sinfids3a1 drbd: n0 Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbdsetup [5318]: cstate StandAlone --> Unconnected Sep 21 10:11:51 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection Sep 21 10:11:51 sinfids3a1 drbd: ] Sep 21 10:11:51 sinfids3a1 drbd: . Sep 21 10:11:51 sinfids3a1 drbd: WARN: stdin/stdout is not a TTY; using /dev/console Sep 21 10:11:51 sinfids3a1 kernel: drbd0: using degr_wfc_timeout=10 seconds Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFConnection --> WFReportParams Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection established. Sep 21 10:11:53 sinfids3a1 kernel: drbd0: I am(S): 1:00000002:00000001:000000c5:00000033:10 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Peer(P): 1:00000002:00000001:000000c3:00000034:10 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFReportParams --> WFBitMapS Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock_sendmsg returned -32 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate WFBitMapS --> BrokenPipe Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short sent ReportBitMap size=4096 sent=0 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Secondary/Unknown --> Secondary/Primary Sep 21 10:11:53 sinfids3a1 kernel: drbd0: meta connection shut down by peer. Sep 21 10:11:53 sinfids3a1 kernel: drbd0: asender terminated Sep 21 10:11:53 sinfids3a1 kernel: drbd0: sock was shut down by peer Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> BrokenPipe Sep 21 10:11:53 sinfids3a1 kernel: drbd0: short read expecting header on sock: r=0 Sep 21 10:11:53 sinfids3a1 kernel: drbd0: worker terminated Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate BrokenPipe --> Unconnected Sep 21 10:11:53 sinfids3a1 kernel: drbd0: Connection lost. Sep 21 10:11:53 sinfids3a1 kernel: drbd0: drbd0_receiver [5319]: cstate Unconnected --> WFConnection Sep 21 10:12:01 sinfids3a1 rc: Starting drbd: succeeded Sep 21 10:22:55 sinfids3a1 kernel: drbd0: Secondary/Unknown --> Primary/Unknown Sep 21 10:22:57 sinfids3a1 sinfids3a_Filesystem[7631]: [7643]: INFO: Running start for /dev/drbd0 on /export Sep 21 10:10:57 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Down Sep 21 10:11:00 sinfids3a2 kernel: bcm5700: eth1 NIC Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate WFConnection --> WFReportParams Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Connection established. Sep 21 10:11:53 sinfids3a2 kernel: drbd0: I am(P): 1:00000002:00000001:000000c3:00000034:10 Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Peer(S): 1:00000002:00000001:000000c5:00000033:10 Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Current Primary shall become sync TARGET! Aborting to prevent data corruption. Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate WFReportParams --> StandAlone Sep 21 10:11:53 sinfids3a2 kernel: drbd0: error receiving ReportParams, l: 72! Sep 21 10:11:53 sinfids3a2 kernel: drbd0: asender terminated Sep 21 10:11:53 sinfids3a2 kernel: drbd0: worker terminated Sep 21 10:11:53 sinfids3a2 kernel: drbd0: drbd0_receiver [5318]: cstate StandAlone --> StandAlone Sep 21 10:11:53 sinfids3a2 kernel: drbd0: Connection lost. Sep 21 10:11:53 sinfids3a2 kernel: drbd0: receiver terminated -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070924/3af355ed/attachment.htm>