Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
here some more log: node B (slave): <CABLE-PULL> Jan 14 10:47:51 ha2 kernel: e1000e: eth0 NIC Link is Down Jan 14 10:47:57 ha2 kernel: block drbd0: PingAck did not arrive in time. Jan 14 10:47:57 ha2 kernel: block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jan 14 10:47:57 ha2 kernel: block drbd0: asender terminated Jan 14 10:47:57 ha2 kernel: block drbd0: Terminating asender thread Jan 14 10:47:57 ha2 kernel: block drbd0: short read expecting header on sock: r=-512 Jan 14 10:47:57 ha2 kernel: block drbd0: Connection closed Jan 14 10:47:57 ha2 kernel: block drbd0: conn( NetworkFailure -> Unconnected ) Jan 14 10:47:57 ha2 kernel: block drbd0: receiver terminated Jan 14 10:47:57 ha2 kernel: block drbd0: Restarting receiver thread Jan 14 10:47:57 ha2 kernel: block drbd0: receiver (re)started Jan 14 10:47:57 ha2 kernel: block drbd0: conn( Unconnected -> WFConnection ) an 14 10:48:02 ha2 ntpd[2576]: synchronized to LOCAL(0), stratum 10 Jan 14 10:48:20 ha2 heartbeat: [2729]: WARN: node 10.0.1.11: is dead Jan 14 10:48:20 ha2 heartbeat: [2729]: WARN: node ha1: is dead Jan 14 10:48:20 ha2 heartbeat: [2729]: WARN: No STONITH device configured. Jan 14 10:48:20 ha2 heartbeat: [2729]: WARN: Shared disks are not protected. Jan 14 10:48:20 ha2 ipfail: [2826]: info: Status update: Node 10.0.1.11 now has status dead Jan 14 10:48:20 ha2 heartbeat: [2729]: info: Resources being acquired from ha1. Jan 14 10:48:20 ha2 heartbeat: [2729]: info: Link 10.0.1.11:10.0.1.11 dead. Jan 14 10:48:20 ha2 heartbeat: [2729]: info: Link ha1:eth0 dead. Jan 14 10:48:20 ha2 harc[3063]: [3078]: info: Running /etc/ha.d/rc.d/status status Jan 14 10:48:20 ha2 heartbeat: [3064]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ha2] to acquire. Jan 14 10:48:20 ha2 heartbeat: [3064]: info: Writing type [resource] message to FIFO Jan 14 10:48:20 ha2 heartbeat: [3064]: info: FIFO message [type resource] written rc=79 Jan 14 10:48:20 ha2 heartbeat: [2729]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (1)) Jan 14 10:48:20 ha2 heartbeat: [2729]: info: Managed req_our_resources process 3064 exited with return code 0. Jan 14 10:48:20 ha2 heartbeat: [2729]: info: AnnounceTakeover(local 1, foreign 1, reason 'req_our_resources' (1)) Jan 14 10:48:20 ha2 heartbeat: [2729]: info: Managed status process 3063 exited with return code 0. Jan 14 10:48:20 ha2 harc[3102]: [3108]: info: Running /etc/ha.d/rc.d/status status Jan 14 10:48:20 ha2 mach_down[3114]: [3135]: info: Taking over resource group IPaddr::10.0.1.221 Jan 14 10:48:21 ha2 ResourceManager[3136]: [3147]: info: Acquiring resource group: ha1 IPaddr::10.0.1.221 drbddisk::drbd0 Filesystem::/dev/drbd0::/opt/trustdex::ext3::defaults SendStatusMail TrustDEX watchdog Jan 14 10:48:21 ha2 IPaddr[3159]: [3190]: INFO: Resource is stopped Jan 14 10:48:21 ha2 ResourceManager[3136]: [3204]: info: Running /etc/ha.d/resource.d/IPaddr 10.0.1.221 start Jan 14 10:48:21 ha2 IPaddr[3223]: [3254]: INFO: Using calculated nic for 10.0.1.221: eth0 Jan 14 10:48:21 ha2 IPaddr[3223]: [3259]: INFO: Using calculated netmask for 10.0.1.221: 255.0.0.0 Jan 14 10:48:21 ha2 IPaddr[3223]: [3281]: INFO: eval ifconfig eth0:0 10.0.1.221 netmask 255.0.0.0 broadcast 10.255.255.255 Jan 14 10:48:21 ha2 IPaddr[3206]: [3300]: INFO: Success Jan 14 10:48:21 ha2 ResourceManager[3136]: [3330]: info: Running /etc/ha.d/resource.d/drbddisk drbd0 start Jan 14 10:48:21 ha2 kernel: block drbd0: role( Secondary -> Primary ) ... Jan 14 10:48:24 ha2 ipfail: [2826]: info: We are dead. :< Jan 14 10:48:24 ha2 ipfail: [2826]: info: Asking other side for ping node count. Jan 14 10:48:24 ha2 ipfail: [2826]: info: Link Status update: Link ha1/eth0 now has status dead ... <CABLE-IN> Jan 14 10:50:07 ha2 kernel: e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX ... an 14 10:50:08 ha2 kernel: block drbd0: Split-Brain detected, 2 primaries, automatically solved. Sync from this node Jan 14 10:50:08 ha2 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Jan 14 10:50:08 ha2 heartbeat: [2729]: CRIT: Cluster node ha1 returning after partition. Jan 14 10:50:08 ha2 heartbeat: [2729]: info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain Jan 14 10:50:08 ha2 heartbeat: [2729]: WARN: Deadtime value may be too small. Jan 14 10:50:08 ha2 heartbeat: [2729]: info: See FAQ for information on tuning deadtime. Jan 14 10:50:08 ha2 heartbeat: [2729]: info: URL: http://linux-ha.org/FAQ#heavy_load Jan 14 10:50:08 ha2 heartbeat: [2729]: info: Link ha1:eth0 up. Jan 14 10:50:08 ha2 heartbeat: [2729]: WARN: Late heartbeat: Node ha1: interval 138000 ms Jan 14 10:50:08 ha2 heartbeat: [2729]: info: Status update for node ha1: status active Jan 14 10:50:08 ha2 ipfail: [2826]: info: Link Status update: Link ha1/eth0 now has status up Jan 14 10:50:08 ha2 ipfail: [2826]: info: Status update: Node ha1 now has status active Jan 14 10:50:08 ha2 harc[4478]: [4484]: info: Running /etc/ha.d/rc.d/status status Jan 14 10:50:08 ha2 heartbeat: [2729]: info: Link 10.0.1.11:10.0.1.11 up. Jan 14 10:50:08 ha2 heartbeat: [2729]: WARN: Late heartbeat: Node 10.0.1.11: interval 138000 ms Jan 14 10:50:08 ha2 heartbeat: [2729]: info: Status update for node 10.0.1.11: status ping Jan 14 10:50:08 ha2 ipfail: [2826]: info: Link Status update: Link 10.0.1.11/10.0.1.11 now has status up Jan 14 10:50:08 ha2 ipfail: [2826]: info: Status update: Node 10.0.1.11 now has status ping Jan 14 10:50:08 ha2 ipfail: [2826]: info: A ping node just came up. Jan 14 10:50:08 ha2 heartbeat: [2729]: info: Managed status process 4478 exited with return code 0. Jan 14 10:50:10 ha2 ipfail: [2826]: info: Asking other side for ping node count. Jan 14 10:50:10 ha2 heartbeat: [2729]: info: hb_giveup_resources(): current status: active Jan 14 10:50:10 ha2 heartbeat: [2729]: info: Heartbeat shutdown in progress. (2729) Jan 14 10:50:10 ha2 heartbeat: [4494]: info: Giving up all HA resources. ... Jan 14 10:50:19 ha2 kernel: block drbd0: peer( Primary -> Unknown ) conn( WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jan 14 10:50:19 ha2 kernel: block drbd0: asender terminated Jan 14 10:50:19 ha2 kernel: block drbd0: Terminating asender thread Jan 14 10:50:19 ha2 kernel: block drbd0: short read expecting header on sock: r=-512 Jan 14 10:50:20 ha2 kernel: block drbd0: short sent ReportBitMap size=4096 sent=1596 Jan 14 10:50:20 ha2 kernel: block drbd0: Connection closed Jan 14 10:50:20 ha2 kernel: block drbd0: conn( NetworkFailure -> Unconnected ) Jan 14 10:50:20 ha2 kernel: block drbd0: receiver terminated Jan 14 10:50:20 ha2 kernel: block drbd0: Restarting receiver thread Jan 14 10:50:20 ha2 kernel: block drbd0: receiver (re)started Jan 14 10:50:20 ha2 kernel: block drbd0: conn( Unconnected -> WFConnection ) Jan 14 10:50:26 ha2 ResourceManager[4507]: [4772]: info: Running /etc/ha.d/resource.d/SendStatusMail stop Jan 14 10:50:31 ha2 ResourceManager[4507]: [4825]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /opt/trustdex ext3 defaults stop Jan 14 10:50:31 ha2 Filesystem[4838]: [4868]: INFO: Running stop for /dev/drbd0 on /opt/trustdex Jan 14 10:50:31 ha2 Filesystem[4838]: [4878]: INFO: Trying to unmount /opt/trustdex Jan 14 10:50:31 ha2 Filesystem[4838]: [4881]: INFO: unmounted /opt/trustdex successfully Jan 14 10:50:31 ha2 Filesystem[4827]: [4887]: INFO: Success Jan 14 10:50:31 ha2 ResourceManager[4507]: [4902]: info: Running /etc/ha.d/resource.d/drbddisk drbd0 stop Jan 14 10:50:31 ha2 kernel: block drbd0: role( Primary -> Secondary ) Jan 14 10:50:31 ha2 ResourceManager[4507]: [4923]: info: Running /etc/ha.d/resource.d/IPaddr 10.0.1.221 stop Jan 14 10:50:31 ha2 IPaddr[4942]: [4957]: INFO: ifconfig eth0:0 down Jan 14 10:50:31 ha2 IPaddr[4925]: [4960]: INFO: Success Jan 14 10:50:31 ha2 heartbeat: [4494]: info: All HA resources relinquished. ... Jan 14 10:50:34 ha2 heartbeat: [2729]: info: Restarting heartbeat ... <~2 MINUTES nothing interesting happens> ... Jan 14 10:52:05 ha2 heartbeat: [5074]: WARN: node ha1: is dead Jan 14 10:52:05 ha2 heartbeat: [5074]: info: Comm_now_up(): updating status to active Jan 14 10:52:05 ha2 heartbeat: [5074]: info: Local status now set to: 'active' Jan 14 10:52:05 ha2 heartbeat: [5074]: info: Starting child client "/usr/lib/heartbeat/ipfail" (498,496) Jan 14 10:52:05 ha2 heartbeat: [5074]: WARN: No STONITH device configured. Jan 14 10:52:05 ha2 heartbeat: [5074]: WARN: Shared disks are not protected. Jan 14 10:52:05 ha2 heartbeat: [5074]: info: Resources being acquired from ha1. Jan 14 10:52:05 ha2 heartbeat: [5292]: info: Starting "/usr/lib/heartbeat/ipfail" as uid 498 gid 496 (pid 5292) Jan 14 10:52:05 ha2 heartbeat: [5294]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ha2] to acquire. Jan 14 10:52:05 ha2 heartbeat: [5294]: info: Writing type [resource] message to FIFO Jan 14 10:52:05 ha2 heartbeat: [5294]: info: FIFO message [type resource] written rc=79 Jan 14 10:52:05 ha2 heartbeat: [5074]: info: AnnounceTakeover(local 0, foreign 1, reason 'T_RESOURCES' (0)) Jan 14 10:52:05 ha2 heartbeat: [5074]: info: AnnounceTakeover(local 1, foreign 1, reason 'T_RESOURCES(us)' (0)) Jan 14 10:52:05 ha2 heartbeat: [5074]: info: Initial resource acquisition complete (T_RESOURCES(us)) Jan 14 10:52:05 ha2 heartbeat: [5074]: info: STATE 1 => 3 Jan 14 10:52:05 ha2 heartbeat: [5074]: info: Managed req_our_resources process 5294 exited with return code 0. Jan 14 10:52:05 ha2 heartbeat: [5074]: info: AnnounceTakeover(local 1, foreign 1, reason 'req_our_resources' (1)) Jan 14 10:52:05 ha2 harc[5293]: [5312]: info: Running /etc/ha.d/rc.d/status status ... Jan 14 10:52:05 ha2 kernel: block drbd0: role( Secondary -> Primary ) -- View this message in context: http://old.nabble.com/failing-slave-node---WFConnection-PRIMARY-unkown-tp30670274p30670353.html Sent from the DRBD - User mailing list archive at Nabble.com.