Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, Debian Lenny AMD64, drbd 8.3.7 (active/passiv, 4 TB disk space), heartbeat 2.1.3, Samba, Apache Nagios told me this morning that our cluster had some problems. At the time I looked at the primary it had a load of ~100 and I couldn't even stop the services or reboot properly. Not even kill -9 the processes. In Cacti I see that the CPU load was constantly increasing since 23:00 yesterday, at that time a script (durep) started which creates disk usage reports. There was nothing drbd related written to the logfiles on the primary, but dmesg output had some drbd lines. It seems that the primary already had problems with sshd several hours before the first drbd kernel message appears. The first drbd message on primay is [9043241.932135] block drbd0: sock was shut down by peer then after some more messages and a partial sync drbd decided that a full sync was necessary. [9043258.995403] block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. Any hints what happend (I guess network problems, but can't find any reason) and why a full sync was started? At the time this happend the data hasn't change much. Below are the dmesg and the log messages I have. TIA! Ralf primary [9027583.425291] INFO: task sshd:1839 blocked for more than 120 seconds. [9027583.425321] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [9027583.425366] sshd D 0000000000000000 0 1839 3139 0x00000004 [9027583.425400] ffff88043ad90710 0000000000000082 0000000000000000 ffff88042a83f810 [9027583.425452] 0000000000000000 000000000000f9e0 ffff88014f9effd8 0000000000015780 [9027583.425504] 0000000000015780 ffff88042a83f810 ffff88042a83fb08 000000005fda4000 [9027583.425555] Call Trace: [9027583.425582] [<ffffffff81152b56>] ? cap_inode_permission+0x0/0x3 [9027583.425613] [<ffffffff812fb049>] ? __mutex_lock_common+0x122/0x192 [9027583.425643] [<ffffffff812fb171>] ? mutex_lock+0x1a/0x31 [9027583.425672] [<ffffffff810f60e0>] ? do_lookup+0x80/0x15d [9027583.425699] [<ffffffff810f68de>] ? __link_path_walk+0x323/0x6f5 [9027583.425728] [<ffffffff810f6ede>] ? path_walk+0x66/0xc9 [9027583.425756] [<ffffffff810f8348>] ? do_path_lookup+0x20/0x77 [9027583.425784] [<ffffffff810f84d4>] ? do_filp_open+0xe5/0x94b [9027583.425813] [<ffffffff81101905>] ? alloc_fd+0x67/0x10c [9027583.425841] [<ffffffff810ecddf>] ? do_sys_open+0x55/0xfc [9027583.425869] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b .... serveral more messages about sshd snipped [9028541.580257] INFO: task sshd:1839 blocked for more than 120 seconds. [9028541.580287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [9028541.580332] sshd D 0000000000000000 0 1839 3139 0x00000004 [9028541.580367] ffff88043ad90710 0000000000000082 0000000000000000 ffff88042a83f810 [9028541.580418] 0000000000000000 000000000000f9e0 ffff88014f9effd8 0000000000015780 [9028541.580470] 0000000000015780 ffff88042a83f810 ffff88042a83fb08 000000005fda4000 [9028541.580521] Call Trace: [9028541.580548] [<ffffffff81152b56>] ? cap_inode_permission+0x0/0x3 [9028541.580579] [<ffffffff812fb049>] ? __mutex_lock_common+0x122/0x192 [9028541.580608] [<ffffffff812fb171>] ? mutex_lock+0[<ffffffff810f84d4>] ? do_filp_open+0xe5/0x94b [9028661.349649] [<ffffffff81101905>] ? alloc_fd+0x67/0x10c [9028661.349677] [<ffffffff810ecddf>] ? do_sys_open+0x55/0xfc [9028661.349705] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b [9043241.932135] block drbd0: sock was shut down by peer [9043241.932166] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) [9043241.932220] block drbd0: short read expecting header on sock: r=0 [9043241.932261] block drbd0: asender terminated [9043241.932287] block drbd0: Terminating drbd0_asender [9043242.187599] block drbd0: Creating new current UUID [9043242.234463] block drbd0: Connection closed [9043242.234490] block drbd0: conn( BrokenPipe -> Unconnected ) [9043242.234519] block drbd0: receiver terminated [9043242.234543] block drbd0: Restarting drbd0_receiver [9043242.234569] block drbd0: receiver (re)started [9043242.234595] block drbd0: conn( Unconnected -> WFConnection ) [9043242.379233] block drbd0: Handshake successful: Agreed network protocol version 91 [9043242.384214] block drbd0: conn( WFConnection -> WFReportParams ) [9043242.384249] block drbd0: Starting asender thread (from drbd0_receiver [3363]) [9043242.431143] block drbd0: data-integrity-alg: <not-used> [9043242.431178] block drbd0: drbd_sync_handshake: [9043242.431205] block drbd0: self B999F56D32D037D7:F18AA9D6D66ED05D:EAA757EA0F199216:3ADE912D60DA56CE bits:0 flags:0 [9043242.431255] block drbd0: peer F18AA9D6D66ED05C:0000000000000000:EAA757EA0F199217:3ADE912D60DA56CE bits:0 flags:0 [9043242.431305] block drbd0: uuid_compare()=1 by rule 70 [9043242.431333] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) [9043254.779842] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) [9043254.779898] block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]). [9043255.334230] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) [9043255.334263] block drbd0: conn( SyncSource -> Conterminated [9043256.751514] block drbd0: Terminating drbd0_asender [9043256.751581] block drbd0: Connection closed [9043256.751608] block drbd0: conn( BrokenPipe -> Unconnected ) [9043256.751636] block drbd0: receiver terminated [9043256.751661] block drbd0: Restarting drbd0_receiver [9043256.751686] block drbd0: receiver (re)started [9043256.751713] block drbd0: conn( Unconnected -> WFConnection ) [9043258.923317] block drbd0: Handshake successful: Agreed network protocol version 91 [9043258.923364] block drbd0: conn( WFConnection -> WFReportParams ) [9043258.923399] block drbd0: Starting asender thread (from drbd0_receiver [3363]) [9043258.995190] block drbd0: data-integrity-alg: <not-used> [9043258.995224] block drbd0: drbd_sync_handshake: [9043258.995250] block drbd0: self 7EBCD7B5EC664DBD:B999F56D32D037D7:DE1044FC36B6EE2E:F18AA9D6D66ED05D bits:0 flags:0 [9043258.995300] block drbd0: peer DE1044FC36B6EE2E:0000000000000000:EAA757EA0F199217:3ADE912D60DA56CE bits:0 flags:0 [9043258.995349] block drbd0: uuid_compare()=2 by rule 80 [9043258.995375] block drbd0: Becoming sync source due to disk states. [9043258.995403] block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. [9043260.646413] block drbd0: 4469 GB (1171599360 bits) marked out-of-sync by on disk bit-map. [9043260.656048] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Inconsistent ) [9043265.549199] block drbd0: conn( WFBitMapS -> SyncSource ) [9043265.549234] block drbd0: Began resync as SyncSource (will sync 4686397440 KB [1171599360 bits set]). secondary [10320197.564697] block drbd0: PingAck did not arrive in time. [10320197.564736] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [10320197.564797] block drbd0: asender terminated [10320197.564807] block drbd0: short read expecting header on sock: r=-512 [10320197.564850] block drbd0: Terminating drbd0_asender [10320197.594067] block drbd0: Connection closed [10320197.594099] block drbd0: conn( NetworkFailure -> Unconnected ) [10320197.594134] block drbd0: receiver terminated [10320197.594161] block drbd0: Restarting drbd0_receiver [10320197.594189] block drbd0: receiver (re)started [10320197.594219] block drbd0: conn( Unconnected -> WFConnection ) [10320198.790359] block drbd0: Handshake successful: Agreed network protocol version 91 [10320198.790411] block drbd0: conn( WFConnection -> WFReportParams ) [10320198.790564] block drbd0: Starting asender thread (from drbd0_receiver [3820]) [10320198.872554] block drbd0: data-integrity-alg: <not-used> [10320198.872600] block drbd0: drbd_sync_handshake: [10320198.872630] block drbd0: self F18AA9D6D66ED05C:0000000000000000:EAA757EA0F199217:3ADE912D60DA56CE bits:0 flags:0 [10320198.872684] block drbd0: peer B999F56D32D037D7:F18AA9D6D66ED05D:EAA757EA0F199216:3ADE912D60DA56CE bits:0 flags:0 [10320198.872737] block drbd0: uuid_compare()=-1 by rule 50 [10320198.872778] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) [10320211.090254] block drbd0: conn( WFBitMapT -> WFSyncUUID ) [10320211.221243] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 [10320211.223272] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) [10320211.223333] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) [10320211.223388] block drbd0: Began resync as SyncTarget (will sync 0 KB [0 bits set]). [10320211.721368] block drbd0: PingAck did not arrive in time. [10320211.721403] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [10320211.721465] block drbd0: asender terminated [10320211.721491] block drbd0: Terminating drbd0_asender [10320211.737116] block drbd0: Connection closed [10320211.737149] block drbd0: conn( NetworkFailure -> Unconnected ) [10320211.737183] block drbd0: receiver terminated [10320211.737210] block drbd0: Restarting drbd0_receiver [10320211.737238] block drbd0: receiver (re)started [10320211.737267] block drbd0: conn( Unconnected -> WFConnection ) [10320215.334423] block drbd0: Handshake successful: Agreed network protocol version 91 [10320215.334475] block drbd0: conn( WFConnection -> WFReportParams ) [10320215.334628] block drbd0: Starting asender thread (from drbd0_receiver [3820]) [10320215.436581] block drbd0: data-integrity-alg: <not-used> [10320215.436630] block drbd0: drbd_sync_handshake: [10320215.436659] block drbd0: self DE1044FC36B6EE2E:0000000000000000:EAA757EA0F199217:3ADE912D60DA56CE bits:0 flags:0 [10320215.436714] block drbd0: peer 7EBCD7B5EC664DBD:B999F56D32D037D7:DE1044FC36B6EE2E:F18AA9D6D66ED05D bits:0 flags:0 [10320215.436763] block drbd0: uuid_compare()=-2 by rule 60 [10320215.436789] block drbd0: Becoming sync target due to disk states. [10320215.436816] block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. [10320217.023525] block drbd0: 4469 GB (1171599360 bits) marked out-of-sync by on disk bit-map. [10320217.055032] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) [10320221.972625] block drbd0: conn( WFBitMapT -> WFSyncUUID ) [10320221.990506] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 [10320221.992053] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) [10320221.992102] block drbd0: conn( WFSyncUUID -> SyncTarget ) [10320221.992131] block drbd0: Began resync as SyncTarget (will sync 4686397440 KB [1171599360 bits set]). Jun 20 07:55:19 MySecondary kernel: [10320197.564736] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jun 20 07:55:19 MySecondary kernel: [10320197.564797] block drbd0: asender terminated Jun 20 07:55:19 MySecondary kernel: [10320197.564850] block drbd0: Terminating drbd0_asender Jun 20 07:55:19 MySecondary kernel: [10320197.594067] block drbd0: Connection closed Jun 20 07:55:19 MySecondary kernel: [10320197.594099] block drbd0: conn( NetworkFailure -> Unconnected ) Jun 20 07:55:19 MySecondary kernel: [10320197.594134] block drbd0: receiver terminated Jun 20 07:55:19 MySecondary kernel: [10320197.594161] block drbd0: Restarting drbd0_receiver Jun 20 07:55:19 MySecondary kernel: [10320197.594189] block drbd0: receiver (re)started Jun 20 07:55:19 MySecondary kernel: [10320197.594219] block drbd0: conn( Unconnected -> WFConnection ) Jun 20 07:55:21 MySecondary kernel: [10320198.790359] block drbd0: Handshake successful: Agreed network protocol version 91 Jun 20 07:55:21 MySecondary kernel: [10320198.790411] block drbd0: conn( WFConnection -> WFReportParams ) Jun 20 07:55:21 MySecondary kernel: [10320198.790564] block drbd0: Starting asender thread (from drbd0_receiver [3820]) Jun 20 07:55:21 MySecondary kernel: [10320198.872554] block drbd0: data-integrity-alg: <not-used> Jun 20 07:55:21 MySecondary kernel: [10320198.872600] block drbd0: drbd_sync_handshake: Jun 20 07:55:21 MySecondary kernel: [10320198.872630] block drbd0: self F18AA9D6D66ED05C:0000000000000000:EAA757EA0F199217:3ADE912D60DA56CE bits:0 flags:0 Jun 20 07:55:21 MySecondary kernel: [10320198.872684] block drbd0: peer B999F56D32D037D7:F18AA9D6D66ED05D:EAA757EA0F199216:3ADE912D60DA56CE bits:0 flags:0 Jun 20 07:55:21 MySecondary kernel: [10320198.872737] block drbd0: uuid_compare()=-1 by rule 50 Jun 20 07:55:21 MySecondary kernel: [10320198.872778] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jun 20 07:55:33 MySecondary kernel: [10320211.090254] block drbd0: conn( WFBitMapT -> WFSyncUUID ) Jun 20 07:55:33 MySecondary kernel: [10320211.221243] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Jun 20 07:55:33 MySecondary kernel: [10320211.223272] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Jun 20 07:55:33 MySecondary kernel: [10320211.223333] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Jun 20 07:55:33 MySecondary kernel: [10320211.223388] block drbd0: Began resync as SyncTarget (will sync 0 KB [0 bits set]). Jun 20 07:55:33 MySecondary kernel: [10320211.721403] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jun 20 07:55:33 MySecondary kernel: [10320211.721465] block drbd0: asender terminated Jun 20 07:55:33 MySecondary kernel: [10320211.721491] block drbd0: Terminating drbd0_asender Jun 20 07:55:33 MySecondary kernel: [10320211.737116] block drbd0: Connection closed Jun 20 07:55:33 MySecondary kernel: [10320211.737149] block drbd0: conn( NetworkFailure -> Unconnected ) Jun 20 07:55:33 MySecondary kernel: [10320211.737183] block drbd0: receiver terminated Jun 20 07:55:33 MySecondary kernel: [10320211.737210] block drbd0: Restarting drbd0_receiver Jun 20 07:55:33 MySecondary kernel: [10320211.737238] block drbd0: receiver (re)started Jun 20 07:55:33 MySecondary kernel: [10320211.737267] block drbd0: conn( Unconnected -> WFConnection ) Jun 20 07:55:37 MySecondary kernel: [10320215.334423] block drbd0: Handshake successful: Agreed network protocol version 91 Jun 20 07:55:37 MySecondary kernel: [10320215.334475] block drbd0: conn( WFConnection -> WFReportParams ) Jun 20 07:55:37 MySecondary kernel: [10320215.334628] block drbd0: Starting asender thread (from drbd0_receiver [3820]) Jun 20 07:55:37 MySecondary kernel: [10320215.436581] block drbd0: data-integrity-alg: <not-used> Jun 20 07:55:37 MySecondary kernel: [10320215.436630] block drbd0: drbd_sync_handshake: Jun 20 07:55:37 MySecondary kernel: [10320215.436659] block drbd0: self DE1044FC36B6EE2E:0000000000000000:EAA757EA0F199217:3ADE912D60DA56CE bits:0 flags:0 Jun 20 07:55:37 MySecondary kernel: [10320215.436714] block drbd0: peer 7EBCD7B5EC664DBD:B999F56D32D037D7:DE1044FC36B6EE2E:F18AA9D6D66ED05D bits:0 flags:0 Jun 20 07:55:37 MySecondary kernel: [10320215.436763] block drbd0: uuid_compare()=-2 by rule 60 Jun 20 07:55:37 MySecondary kernel: [10320215.436789] block drbd0: Becoming sync target due to disk states. Jun 20 07:55:37 MySecondary kernel: [10320215.436816] block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. Jun 20 07:55:39 MySecondary kernel: [10320217.023525] block drbd0: 4469 GB (1171599360 bits) marked out-of-sync by on disk bit-map. Jun 20 07:55:39 MySecondary kernel: [10320217.055032] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Jun 20 07:55:44 MySecondary kernel: [10320221.972625] block drbd0: conn( WFBitMapT -> WFSyncUUID ) Jun 20 07:55:44 MySecondary kernel: [10320221.990506] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Jun 20 07:55:44 MySecondary kernel: [10320221.992053] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Jun 20 07:55:44 MySecondary kernel: [10320221.992102] block drbd0: conn( WFSyncUUID -> SyncTarget ) Jun 20 07:55:44 MySecondary kernel: [10320221.992131] block drbd0: Began resync as SyncTarget (will sync 4686397440 KB [1171599360 bits set]).