[DRBD-user] Re: We had a problem with drdb 05/12/2007.-

Wed Dec 5 23:35:53 CET 2007

On Wed, Dec 05, 2007 at 03:09:34PM -0400, Marco Antonio Diaz-Valdes Verdugo wrote:
> Dear sirs: 
>
> We have a Linux production cluster. The software involved is  Slackware 
> 10.1.0 with kernel 2.4.29 operation system, 0.7.15 drbd version and version 
> 1.2.3.cvs.20050927 Heartbeat. In the last days, we had a drdb strange 
> behavior. 

Dear Marco,

With drbd 0.7 (even with the most recent versions),
the decision about which node has the better data
can get easily confused when independend events are hapenning on the
servers, e.g. several connection loss events in a row,
global cluster crash, several reboots while the replication network
is broken, admins making both nodes primary independendly in any order
in an attempt to "fix" things, stuff like that.
This usually means that you even hav diverging data sets on both nodes.

You are better off using drbd 8, which overcomes this problem
and can reliably detect such situations, and then refuse
to do any sync until you decide what went wrong and chose one side
to be the "bad" one, discarding the changes that happend there
while the "good" one did independend modifications.

You'd need to upgrade your kernel to a linux 2.6. to be able to
use drbd 8, so you'd probably need to upgrade the whole system.

> During the re-synchronization process between Primary server (known as 
> Alcamari) and Secondary server (known as Apu), both components of cluster  
> (known as Recua), there was a data exchange in a wrong way (secondary to 
> primary instead of primary to secondary). The result was a loss of 
> information (files, data bases and directories)

and you chose to not mention what happened before that resync.
for a resync to happen, there need to be a failure and recovery first.
and I am sure yours was not a single failure, but a multiple failure
scenario, where probably the servers did not see each other for an
extended period of time, and probably one was active without seeing the
other, then the other was active without seeing the first.

also, you chose "primary" and "secondary" to name your servers.
we usually use these terms to describe the
 _current role_ of drbd on any server.

> This was the first time that happened to us.
> 
> Are there evidences of similar situations in this drdb software version?

yes and no, see explanations above; there are multiple error scenarios
where drbd 0.7 might end up "confused".

but according to your logs (thanks for completeness and having the
timestamps in sync!) your problem is not caused by drbd,
but by operator error and heartbeat misconfiguration.
let me explain:

> =========================================================================================
> Nov  8 23:34:20 alcamari heartbeat[788]: info: Heartbeat shutdown in progress. (788)
> Nov  8 23:34:20 alcamari heartbeat[15633]: info: Giving up all HA resources.
> Nov  8 23:34:20 alcamari heartbeat: info: Releasing resource group: alcamari 192.168.53.44 drbddisk Filesystem::/dev/drbd0::/home::reiserfs mysql gateway
> Nov  8 23:34:20 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/gateway  stop
> Nov  8 23:34:20 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/mysql  stop
> Nov  8 23:34:23 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home reiserfs stop
> Nov  8 23:34:29 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/drbddisk  stop
> Nov  8 23:34:29 alcamari kernel: drbd0: Primary/Unknown --> Secondary/Unknown

you lost your replication link long before that.
see: the other node is marked as "Unknown" here,
so this is already a degraded cluster.
what did happen before?
when did they lose the replication link?
why?

> Nov  8 23:34:32 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/IPaddr 192.168.53.44 stop
> Nov  8 23:34:32 alcamari heartbeat: info: /sbin/route -n del -host 192.168.53.44
> Nov  8 23:34:32 alcamari heartbeat: info: /sbin/ifconfig eth0:0 down
> Nov  8 23:34:32 alcamari heartbeat: info: IP Address 192.168.53.44 released
> Nov  8 23:34:32 alcamari heartbeat[15633]: info: All HA resources relinquished.
> Nov  8 23:34:33 alcamari heartbeat[788]: info: killing HBFIFO process 798 with signal 15
> Nov  8 23:34:33 alcamari heartbeat[788]: info: killing HBWRITE process 799 with signal 15
> Nov  8 23:34:33 alcamari heartbeat[788]: info: killing HBREAD process 800 with signal 15
> Nov  8 23:34:33 alcamari heartbeat[788]: info: killing HBWRITE process 801 with signal 15
> Nov  8 23:34:33 alcamari heartbeat[788]: info: killing HBREAD process 802 with signal 15
> Nov  8 23:34:33 alcamari heartbeat[788]: info: Core process 802 exited. 5 remaining
> Nov  8 23:34:33 alcamari heartbeat[788]: info: Core process 801 exited. 4 remaining
> Nov  8 23:34:33 alcamari heartbeat[788]: info: Core process 800 exited. 3 remaining
> Nov  8 23:34:33 alcamari heartbeat[788]: info: Core process 798 exited. 2 remaining
> Nov  8 23:34:33 alcamari heartbeat[788]: info: Core process 799 exited. 1 remaining
> Nov  8 23:34:33 alcamari heartbeat[788]: info: Heartbeat shutdown complete.
> Nov  8 23:34:52 alcamari kernel: drbd0: drbdsetup [15823]: cstate StandAlone --> Unconnected
> Nov  8 23:34:52 alcamari kernel: drbd0: drbdsetup [15823]: cstate Unconnected --> StandAlone
> Nov  8 23:34:52 alcamari kernel: drbd0: drbdsetup [15823]: cstate StandAlone --> Unconfigured
> Nov  8 23:34:52 alcamari kernel: drbd0: worker terminated
> Nov  8 23:34:52 alcamari kernel: drbd: module cleanup done.
> Nov  8 23:34:53 alcamari sshd[15818]: Accepted password for apptiaxa from 192.168.53.41 port 35634 ssh2

so you are now trying to restart drbd completely in an attempt to "fix"
something.

knowning that alcamari has been primary before,
assuming your resource name is "r0", you should now have said
	drbdadm attach r0
	drbdadm primary r0

and only then have said
	drbdadm syncer r0
	drbdadm connect r0

in which case you probably would have ended up with logs saying
"current Primary shall become sync target,
aborting to prevent data corruption".

but not being aware of that, you did not,
so you get this:

> Nov  8 23:35:01 alcamari kernel: drbd: initialised. Version: 0.7.15 (api:77/proto:74)
> Nov  8 23:35:01 alcamari kernel: drbd: SVN Revision: 2020 build by root at alcamari.movistar.com.pe, 2006-02-06 15:53:10
> Nov  8 23:35:01 alcamari kernel: drbd: registered as block device major 147
> Nov  8 23:35:01 alcamari kernel: drbd0: resync bitmap: bits=66765502 words=2086422
> Nov  8 23:35:01 alcamari kernel: drbd0: size = 254 GB (267062008 KB)
> Nov  8 23:35:03 alcamari kernel: drbd0: 8410 MB marked out-of-sync by on disk bit-map.
> Nov  8 23:35:03 alcamari kernel: drbd0: Found 6 transactions (324 active extents) in activity log.
> Nov  8 23:35:03 alcamari kernel: drbd0: drbdsetup [15857]: cstate Unconfigured --> StandAlone
> Nov  8 23:35:03 alcamari kernel: drbd0: drbdsetup [15870]: cstate StandAlone --> Unconnected
> Nov  8 23:35:03 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate Unconnected --> WFConnection
> Nov  8 23:35:07 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate WFConnection --> WFReportParams
> Nov  8 23:35:07 alcamari kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> Nov  8 23:35:07 alcamari kernel: drbd0: Connection established.
> Nov  8 23:35:07 alcamari kernel: drbd0: I am(S): 1:00000002:00000008:00000035:0000000a:00
> Nov  8 23:35:07 alcamari kernel: drbd0: Peer(S): 1:00000003:00000004:00000033:0000000a:00
> Nov  8 23:35:07 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate WFReportParams --> WFBitMapT
> Nov  8 23:35:07 alcamari kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
> Nov  8 23:35:07 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate WFBitMapT --> SyncTarget
> Nov  8 23:35:07 alcamari kernel: drbd0: Resync started as SyncTarget (need to sync 8613708 KB [2153427 bits set]).

and here you got the sync of 8GB worth of changes in the "wrong" direction.

from looking at those generation counters,
I deduce that you have seen that "aborting to prevent data corruption"
log message before, and aparently tried to "fix" something by
erroneously using brute force ("--human" and/or "--do-what-I-say")
options on "Apu" before.

if _you_ use brute force to override DRBDs idea of data integrity,
then don't blame drbd later for making "wrong" decisions.

> Nov  8 23:35:35 alcamari sshd[15885]: Accepted password for apptiaxa from 192.168.53.41 port 35635 ssh2
> Nov  8 23:35:43 alcamari heartbeat[15951]: info: **************************
> Nov  8 23:35:43 alcamari heartbeat[15951]: info: Configuration validated. Starting heartbeat 1.2.3.cvs.20050927
> Nov  8 23:35:43 alcamari heartbeat[15952]: info: heartbeat: version 1.2.3.cvs.20050927
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: Heartbeat generation: 31
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: ucast: write socket priority set to IPTOS_LOWDELAY on eth2
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: ucast: bound send socket to device: eth2
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: ucast: bound receive socket to device: eth2
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: ucast: started on port 694 interface eth2 to 172.16.0.2
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: ping heartbeat started.
> Nov  8 23:35:44 alcamari heartbeat[15955]: info: pid 15955 locked in memory.
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: pid 15952 locked in memory.
> Nov  8 23:35:44 alcamari heartbeat[15952]: info: Local status now set to: 'up'
> Nov  8 23:35:45 alcamari heartbeat[15956]: info: pid 15956 locked in memory.
> Nov  8 23:35:45 alcamari heartbeat[15959]: info: pid 15959 locked in memory.
> Nov  8 23:35:45 alcamari heartbeat[15958]: info: pid 15958 locked in memory.
> Nov  8 23:35:45 alcamari heartbeat[15957]: info: pid 15957 locked in memory.
> Nov  8 23:36:14 alcamari heartbeat[15952]: info: Resources being acquired from apu.
> Nov  8 23:36:14 alcamari heartbeat[15952]: info: Local status now set to: 'active'
> Nov  8 23:36:14 alcamari heartbeat[15969]: info: Local Resource acquisition completed.
> Nov  8 23:36:14 alcamari heartbeat[15952]: info: Initial resource acquisition complete (T_RESOURCES(us))
> Nov  8 23:36:19 alcamari heartbeat: info: Running /usr/local/etc/ha.d/rc.d/status status
> Nov  8 23:36:20 alcamari heartbeat: info: /usr/local/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
> Nov  8 23:36:20 alcamari heartbeat[15952]: info: mach_down takeover complete.
> Nov  8 23:36:20 alcamari heartbeat: info: mach_down takeover complete for node apu.
> Nov  8 23:36:20 alcamari heartbeat: info: Running /usr/local/etc/ha.d/rc.d/status status
> Nov  8 23:36:20 alcamari heartbeat: info: Running /usr/local/etc/ha.d/rc.d/ip-request-resp ip-request-resp
> Nov  8 23:36:20 alcamari heartbeat: received ip-request-resp 192.168.53.44 OK yes
> Nov  8 23:36:20 alcamari heartbeat: info: Acquiring resource group: alcamari 192.168.53.44 drbddisk Filesystem::/dev/drbd0::/home::reiserfs mysql gateway
> Nov  8 23:36:20 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/IPaddr 192.168.53.44 start
> Nov  8 23:36:22 alcamari heartbeat: CRIT: Giving up resources due to failure of 192.168.53.44

do you mind explaining why configuring a simple IP would fail?
do you have serious network problems, hardware or software,
or is this a broken configuration?

> Nov  8 23:36:22 alcamari heartbeat: info: Releasing resource group: alcamari 192.168.53.44 drbddisk Filesystem::/dev/drbd0::/home::reiserfs mysql gateway
> Nov  8 23:36:22 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/gateway  stop
> Nov  8 23:36:22 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/mysql  stop
> Nov  8 23:36:22 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home reiserfs stop
> Nov  8 23:36:22 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/drbddisk  stop
> Nov  8 23:36:22 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/IPaddr 192.168.53.44 stop
> Nov  8 23:36:24 alcamari heartbeat[15952]: info: Local Resource acquisition completed. (none)
> Nov  8 23:36:24 alcamari heartbeat[15952]: info: local resource transition completed.
> Nov  8 23:37:26 alcamari heartbeat[15952]: info: Heartbeat shutdown in progress. (15952)

did you just stop heartbeat again?

note that for me the rest of this log
looks like some admin is in panic already,
or some "automatic" recovery script is running havoc.
because there is no attempt to understand the problem,
but only repeated attempts to solve it by restarting heartbeat.
but maybe it is just a serious heartbeat misconfiguration.

anyways, this has nothing more to do with drbd from here on,
so I'll skip some of this log now.

> Nov  8 23:42:40 alcamari kernel: drbd0: drbd0_asender [15882]: cstate SyncTarget --> NetworkFailure
> Nov  8 23:42:40 alcamari kernel: drbd0: asender terminated
> Nov  8 23:42:40 alcamari kernel: drbd0: sock was shut down by peer
> Nov  8 23:42:40 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate NetworkFailure --> BrokenPipe
> Nov  8 23:42:40 alcamari kernel: drbd0: worker terminated
> Nov  8 23:42:40 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate BrokenPipe --> Unconnected
> Nov  8 23:42:40 alcamari kernel: drbd0: Connection lost.
> Nov  8 23:42:40 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate Unconnected --> WFConnection

here is again a disconnect/reconnect event,
so either someone just did something "interessting" on the other node,
or you want to fix your network reliability first.

> Nov  8 23:42:48 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate WFConnection --> WFReportParams
> Nov  8 23:42:48 alcamari kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> Nov  8 23:42:48 alcamari kernel: drbd0: Connection established.
> Nov  8 23:42:48 alcamari kernel: drbd0: I am(S): 0:00000003:00000004:00000033:0000000a:01
> Nov  8 23:42:48 alcamari kernel: drbd0: Peer(S): 1:00000003:00000004:00000033:0000000a:00
> Nov  8 23:42:48 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate WFReportParams --> WFBitMapT
> Nov  8 23:42:48 alcamari kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
> Nov  8 23:42:48 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate WFBitMapT --> SyncTarget
> Nov  8 23:42:48 alcamari kernel: drbd0: Resync started as SyncTarget (need to sync 5507560 KB [1376890 bits set]).
> Nov  8 23:42:57 alcamari kernel: drbd0: sock was shut down by peer
> Nov  8 23:42:57 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate SyncTarget --> BrokenPipe

and again,

> Nov  8 23:42:57 alcamari kernel: drbd0: asender terminated
> Nov  8 23:42:57 alcamari kernel: drbd0: worker terminated
> Nov  8 23:42:57 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate BrokenPipe --> Unconnected
> Nov  8 23:42:57 alcamari kernel: drbd0: Connection lost.
> Nov  8 23:42:57 alcamari kernel: drbd0: drbd0_receiver [15871]: cstate Unconnected --> WFConnection
> Nov  8 23:43:09 alcamari init: Switching to runlevel: 6

and now someone aparently decided that a reboot would fix whatever
problem they had, without understanding the problem first.

<snip>
> Nov  8 23:45:14 alcamari logger: /etc/rc.d/rc.inet1:  /sbin/ifconfig lo 127.0.0.1
> Nov  8 23:45:14 alcamari logger: /etc/rc.d/rc.inet1:  /sbin/route add -net 127.0.0.0 netmask 255.0.0.0 lo
> Nov  8 23:45:14 alcamari logger: /etc/rc.d/rc.inet1:  /sbin/ifconfig eth0 192.168.53.48 broadcast 192.168.53.255 netmask 255.255.255.0
> Nov  8 23:45:15 alcamari logger: /etc/rc.d/rc.inet1:  /sbin/ifconfig eth2 172.16.0.1 broadcast  netmask 
> Nov  8 23:45:15 alcamari logger: /etc/rc.d/rc.inet1:  /sbin/route add default gw 192.168.53.1 metric 1
> Nov  8 23:45:15 alcamari logger: /etc/rc.d/rc.hotplug start (entering script)
> Nov  8 23:45:15 alcamari udevsend[150]: udevd daemon started
> Nov  8 23:45:16 alcamari logger: /etc/rc.d/rc.hotplug start (exiting script)
> Nov  8 23:45:16 alcamari sshd[579]: Server listening on 0.0.0.0 port 22.
> Nov  8 23:45:17 alcamari kernel: tg3: eth0: Link is up at 100 Mbps, half duplex.
> Nov  8 23:45:17 alcamari kernel: tg3: eth0: Flow control is off for TX and off for RX.
> Nov  8 23:45:18 alcamari kernel: tg3: eth2: Link is up at 1000 Mbps, full duplex.
> Nov  8 23:45:18 alcamari kernel: tg3: eth2: Flow control is on for TX and on for RX.

btw, did you recently upgrade your hardware with a new ethernet card?

> Nov  8 23:46:12 alcamari kernel: drbd0: I am(S): 0:00000003:00000004:00000033:0000000a:00
> Nov  8 23:46:12 alcamari kernel: drbd0: Peer(S): 1:00000003:00000004:00000033:0000000a:00

as we know already, alcamari is inconsisten/synctarget meanwhile,
no need to look into this further, the "wrong" decision happened 
at 23:35:07, where I commented that already.

> Nov  8 23:46:45 alcamari kernel: drbd0: Secondary/Secondary --> Primary/Secondary

you are just going online on a sync target.
well, you can, butg that is not a good idea.

I guess here you are about to recognize that your data set just jumped
backwards in time.

> Nov  8 23:46:45 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home reiserfs start
> Nov  8 23:46:49 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/mysql  start
> Nov  8 23:46:49 alcamari heartbeat: info: Running /usr/local/etc/ha.d/resource.d/gateway  start
> Nov  8 23:46:54 alcamari heartbeat[765]: info: Local Resource acquisition completed. (none)
> Nov  8 23:46:54 alcamari heartbeat[765]: info: local resource transition completed.
> Nov  8 23:48:38 alcamari heartbeat[765]: info: Link apu:eth2 up.
> Nov  8 23:48:38 alcamari heartbeat[765]: info: Status update for node apu: status up
> Nov  8 23:48:38 alcamari heartbeat[765]: info: Status update for node apu: status active
> Nov  8 23:48:38 alcamari heartbeat[765]: info: remote resource transition completed.
> Nov  8 23:48:38 alcamari heartbeat: info: Running /usr/local/etc/ha.d/rc.d/status status
> Nov  8 23:48:38 alcamari heartbeat: info: Running /usr/local/etc/ha.d/rc.d/status status
> Nov  8 23:51:24 alcamari sshd[1272]: Failed password for apptiaxa from 192.168.53.41 port 35648 ssh2
> Nov  8 23:51:28 alcamari sshd[1275]: Accepted password for apptiaxa from 192.168.53.41 port 35649 ssh2

right, there is someone logging in, trying to figure out what is going on.

> Nov  8 23:51:30 alcamari sshd[1272]: Accepted password for apptiaxa from 192.168.53.41 port 35648 ssh2
> Nov  8 23:52:04 alcamari sshd[1322]: Accepted password for apptiaxa from 192.168.53.41 port 35651 ssh2
> Nov  8 23:56:13 alcamari sshd[1400]: Accepted password for apptiaxa from 192.168.53.41 port 35655 ssh2
> Nov  8 23:59:26 alcamari sshd[1461]: Accepted password for apptiaxa from 199.186.32.220 port 38365 ssh2

and called for help.

> Nov  9 00:00:47 alcamari sshd[1569]: Accepted password for root from 192.168.53.41 port 35670 ssh2
> Nov  9 00:01:57 alcamari heartbeat[765]: info: Heartbeat shutdown in progress. (765)
> Nov  9 00:01:57 alcamari heartbeat[1638]: info: Giving up all HA resources.

and the decision is to stop everything and investigat. good.
but too late.

...

> Nov  9 00:02:21 alcamari kernel: drbd: module cleanup done.

> Nov  9 00:07:08 alcamari sshd[1990]: Accepted password for apptiaxa from 192.168.53.41 port 35673 ssh2
> Nov  9 00:11:29 alcamari heartbeat[2196]: info: **************************
> Nov  9 00:11:29 alcamari heartbeat[2196]: info: Configuration validated. Starting heartbeat 1.2.3.cvs.20050927

no, heartbeat restart won't magically fix it this time, either.

> Nov  9 00:12:00 alcamari heartbeat: CRIT: Giving up resources due to failure of 192.168.53.44

but again you have your ip address assignment failing?

some reboots later:

> Nov  9 01:20:54 alcamari sshd[3138]: Invalid user perrito from 199.186.32.220
> Nov  9 01:20:54 alcamari sshd[3138]: Failed none for invalid user perrito from 199.186.32.220 port 58868 ssh2
> Nov  9 01:31:38 alcamari sshd[3152]: Bad protocol version identification 'Big-Brother-Monitor-1.9e' from 172.16.0.2
> Nov  9 01:36:37 alcamari sshd[3160]: Bad protocol version identification 'Big-Brother-Monitor-1.9e' from 172.16.0.2
> Nov  9 01:41:38 alcamari sshd[3165]: Bad protocol version identification 'Big-Brother-Monitor-1.9e' from 172.16.0.2
> Nov  9 01:46:38 alcamari sshd[3167]: Bad protocol version identification 'Big-Brother-Monitor-1.9e' from 172.16.0.2
> Nov  9 01:47:11 alcamari sshd[3169]: Accepted password for apptiaxa from 192.168.53.41 port 36007 ssh2
> Nov  9 01:49:01 alcamari exiting on signal 15
> Nov  9 01:49:35 alcamari syslogd 1.4.1: restart.
> Nov  9 01:49:36 alcamari kernel: klogd 1.4.1, log source = /proc/kmsg started.
> Nov  9 01:51:37 alcamari sshd[3564]: Bad protocol version identification 'Big-Brother-Monitor-1.9e' from 172.16.0.2
> Nov  9 01:56:38 alcamari sshd[3828]: Bad protocol version identification 'Big-Brother-Monitor-1.9e' from 172.16.0.2

yes, your big-brother thing aparently kicks in again,
trying to monitor ssh availability.
did you just have a site outage, or bigger network problem there?

> Apu logs
> =============================================================================================================

good, timestamps are in sync between apu and alcamari.

> Nov  8 23:34:33 apu kernel: drbd0: drbdsetup [20425]: cstate WFConnection --> Unconnected
> Nov  8 23:34:33 apu kernel: drbd0: worker terminated
> Nov  8 23:34:33 apu kernel: drbd0: drbd0_receiver [22534]: cstate Unconnected --> StandAlone
> Nov  8 23:34:33 apu kernel: drbd0: Connection lost.
> Nov  8 23:34:33 apu kernel: drbd0: drbd0_receiver [22534]: cstate StandAlone --> StandAlone
> Nov  8 23:34:33 apu kernel: drbd0: receiver terminated
> Nov  8 23:34:33 apu kernel: drbd0: drbdsetup [20425]: cstate StandAlone --> StandAlone
> Nov  8 23:34:33 apu kernel: drbd0: drbdsetup [20425]: cstate StandAlone --> Unconfigured
> Nov  8 23:34:33 apu kernel: drbd0: worker terminated
> Nov  8 23:34:33 apu kernel: drbd: module cleanup done.

yes, this did not have replication connection anymore,
as we already know.

> Nov  8 23:35:05 apu kernel: drbd: initialised. Version: 0.7.15 (api:77/proto:74)
> Nov  8 23:35:05 apu kernel: drbd: SVN Revision: 2020 build by root at apu.movistar.com.pe, 2006-02-07 11:26:08
> Nov  8 23:35:05 apu kernel: drbd: registered as block device major 147
> Nov  8 23:35:05 apu kernel: drbd0: resync bitmap: bits=66765502 words=2086422
> Nov  8 23:35:05 apu kernel: drbd0: size = 254 GB (267062008 KB)
> Nov  8 23:35:07 apu kernel: drbd0: 93 MB marked out-of-sync by on disk bit-map.
> Nov  8 23:35:07 apu kernel: drbd0: Found 6 transactions (324 active extents) in activity log.
> Nov  8 23:35:07 apu kernel: drbd0: drbdsetup [20438]: cstate Unconfigured --> StandAlone
> Nov  8 23:35:07 apu kernel: drbd0: drbdsetup [20451]: cstate StandAlone --> Unconnected
> Nov  8 23:35:07 apu kernel: drbd0: drbd0_receiver [20452]: cstate Unconnected --> WFConnection
> Nov  8 23:35:07 apu kernel: drbd0: drbd0_receiver [20452]: cstate WFConnection --> WFReportParams
> Nov  8 23:35:07 apu kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> Nov  8 23:35:07 apu kernel: drbd0: Connection established.
> Nov  8 23:35:07 apu kernel: drbd0: I am(S): 1:00000003:00000004:00000033:0000000a:00
> Nov  8 23:35:07 apu kernel: drbd0: Peer(S): 1:00000002:00000008:00000035:0000000a:00

and here they are connected,
and deciding "wrongly" becuase you first used brute force on this side,
overriding drbds view of the world, as already mentioned above.

> Nov  8 23:35:07 apu kernel: drbd0: drbd0_receiver [20452]: cstate WFReportParams --> WFBitMapS
> Nov  8 23:35:08 apu kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
> Nov  8 23:35:08 apu kernel: drbd0: drbd0_receiver [20452]: cstate WFBitMapS --> SyncSource
> Nov  8 23:35:08 apu kernel: drbd0: Resync started as SyncSource (need to sync 8613708 KB [2153427 bits set]).
> Nov  8 23:42:40 apu kernel: drbd0: drbdsetup [20734]: cstate SyncSource --> Unconnected

ah, someone decided to "drbdadm down" it here,
or to "drbd stop" or whatever. but too late, harm is done already.

> Nov  9 00:02:03 apu heartbeat: info: Acquiring resource group: apu 192.168.53.44 drbddisk Filesystem::/dev/drbd0::/home::reiserfs mysql gateway

OUTCH.
comparing that line with this line from alcamari:

> Nov  8 23:36:20 alcamari heartbeat: info: Acquiring resource group: alcamari 192.168.53.44 drbddisk Filesystem::/dev/drbd0::/home::reiserfs mysql gateway

you happen to have differing haresources on both nodes?
assigning both nodes to the homenode of the one resource group you have?
this is probably the source of all your problems.
serious heartbeat misconfiguration.

haresources _has_ to be _identical_ on both nodes.
read up about that on linux-ha.org.

hope that helps,

cheers.

-- 
: commercial DRBD/HA support and consulting: sales at linbit.com :
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.