[DRBD-user] automatic recovery, 2 successively drbd_sync_handshake with different results

Wed May 26 10:35:57 CEST 2010

2010/5/25 Maros Timko <timkom at gmail.com>:
>>
>> Hello,
>>
>> I have 2 nodes with a Heartbeat/DRBD configuration.
>> (Node1: primary, the hostname is "bleu", Node2: secondary, the
>> hostname is "rocamadour").
>>
>> I want to check the automatic recovery.
>> my configuration is:
>>                after-sb-0pri discard-least-changes;
>>                after-sb-1pri discard-secondary;
>>                after-sb-2pri call-pri-lost-after-sb;
>
> What is your pri-lost-after-sb doing? Rebooting the node? Which one?

 pri-lost-after-sb "echo perte primaire DRBD >~/.drbdStatus; drbdadm
secondary mysql;  drbdadm outdate mysql; ifconfig eth0 down; sync;
reboot -f";

N1 reboots.

>
>>
>> I disconnect the network wire from N2.
>> N1 and N2 become primaries.
>> I write some data on N1, I write MORE data on N2.
>>
>> I connect the network wire of N2.
>> N1 immediately reboots.
>> And after the boot of N1, data is synchronised from N1 to N2 (bad!!,
>> there are more changes on N2)
>>
>> just before the boot of N1:
>> May 25 16:38:26 rocamadour kernel: [ 1785.500483] block drbd1:
>> drbd_sync_handshake:
>> May 25 16:38:26 rocamadour kernel: [ 1785.500490] block drbd1: self
>> 3E097AF02C8F65F1:106921FD4287CF02:D202FB3C36D5F660:59248EB25AF9FA1B
>> bits:170465 flags:0
>> May 25 16:38:26 rocamadour kernel: [ 1785.500496] block drbd1: peer
>> 7BF8153619974429:106921FD4287CF03:D202FB3C36D5F660:59248EB25AF9FA1B
>> bits:227 flags:0
>> May 25 16:38:26 rocamadour kernel: [ 1785.500502] block drbd1:
>> uuid_compare()=100 by rule 90
>> May 25 16:38:26 rocamadour kernel: [ 1785.500507] block drbd1:
>> Split-Brain detected, 2 primaries, automatically solved. Sync from
>> this node
>
> We have 2 primaries here
>
>>
>> After the boot of N1:
>> May 25 16:38:26 rocamadour kernel: [ 1785.500483] block drbd1:
>> drbd_sync_handshake:
>> May 25 16:38:58 rocamadour kernel: [ 1817.216740] block drbd1: self
>> 3E097AF02C8F65F0:106921FD4287CF02:D202FB3C36D5F660:59248EB25AF9FA1B
>> bits:170475 flags:0
>> May 25 16:38:58 rocamadour kernel: [ 1817.216746] block drbd1: peer
>> 7BF8153619974428:106921FD4287CF03:D202FB3C36D5F660:59248EB25AF9FA1B
>> bits:263168 flags:2
>> May 25 16:38:58 rocamadour kernel: [ 1817.216751] block drbd1:
>> uuid_compare()=100 by rule 90
>> May 25 16:38:58 rocamadour kernel: [ 1817.216756] block drbd1:
>> Split-Brain detected, 0 primaries, automatically solved. Sync from
>> peer node
>>
>
> We have 0 primaries here. You mentined reboot of one node. Who demoted
> the other one?

At the same time on N2, heartbeat gives up all HA resources.
Heartbeat unmounts drbd and "drbd1: role( Primary -> Secondary )",

 --- log on N2 ---
May 25 16:38:26 rocamadour kernel: [ 1785.500483] block drbd1:
drbd_sync_handshake:
May 25 16:38:26 rocamadour kernel: [ 1785.500490] block drbd1: self
3E097AF02C8F65F1:106921FD4287CF02:D202FB3C36D5F660:59248EB25AF9FA1B
bits:170465 flags:0
May 25 16:38:26 rocamadour kernel: [ 1785.500496] block drbd1: peer
7BF8153619974429:106921FD4287CF03:D202FB3C36D5F660:59248EB25AF9FA1B
bits:227 flags:0
May 25 16:38:26 rocamadour kernel: [ 1785.500502] block drbd1:
uuid_compare()=100 by rule 90
May 25 16:38:26 rocamadour kernel: [ 1785.500507] block drbd1:
Split-Brain detected, 2 primaries, automatically solved. Sync from
this node
May 25 16:38:26 rocamadour kernel: [ 1785.500516] block drbd1: peer(
Unknown -> Primary ) conn( WFReportParams -> WFBitMapS ) pdsk(
Outdated -> UpToDate )
May 25 16:38:27 rocamadour harc[13960]: info: Running
/etc/ha.d/rc.d/status status
May 25 16:38:27 rocamadour heartbeat: [1523]: info: Link
192.168.1.119:192.168.1.119 up.
May 25 16:38:27 rocamadour heartbeat: [1523]: WARN: Late heartbeat:
Node 192.168.1.119: interval 141210 ms
May 25 16:38:27 rocamadour ipfail: [1766]: info: Link Status update:
Link 192.168.1.119/192.168.1.119 now has status up
May 25 16:38:27 rocamadour heartbeat: [1523]: info: Status update for
node 192.168.1.119: status ping
May 25 16:38:27 rocamadour ipfail: [1766]: info: Status update: Node
192.168.1.119 now has status ping
May 25 16:38:27 rocamadour heartbeat: [1523]: info: Managed status
process 13960 exited with return code 0.

May 25 16:38:27 rocamadour ipfail: [1766]: info: A ping node just came up.
May 25 16:38:27 rocamadour heartbeat: [1523]: info: all clients are now paused
May 25 16:38:28 rocamadour heartbeat: [1523]: info:
hb_giveup_resources(): current status: active
May 25 16:38:28 rocamadour heartbeat: [1523]: info: Heartbeat shutdown
in progress. (1523)
May 25 16:38:28 rocamadour heartbeat: [13997]: info: Giving up all HA resources.
May 25 16:38:28 rocamadour ResourceManager[14013]: info: Releasing
resource group: bleu IPaddr::192.168.1.228/24 drbddisk::mysql
Filesystem::/dev/drbd1::/mnt/drbd::ext3::defaults mysql mon
(...)
May 25 16:38:45 rocamadour Filesystem[14305]: INFO: Running stop for
/dev/drbd1 on /mnt/drbd
May 25 16:38:45 rocamadour Filesystem[14305]: INFO: Trying to unmount /mnt/drbd
May 25 16:38:45 rocamadour Filesystem[14305]: INFO: unmounted
/mnt/drbd successfully
May 25 16:38:45 rocamadour Filesystem[14289]: INFO:  Success
May 25 16:38:45 rocamadour ResourceManager[14013]: info: Running
/etc/ha.d/resource.d/drbddisk mysql stop
May 25 16:38:45 rocamadour kernel: [ 1804.477879] block drbd1: role(
Primary -> Secondary )
May 25 16:38:45 rocamadour ResourceManager[14013]: info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.228/24 stop
May 25 16:38:46 rocamadour IPaddr[14456]: INFO: ifconfig eth0:0 down
May 25 16:38:46 rocamadour IPaddr[14428]: INFO:  Success
May 25 16:38:46 rocamadour heartbeat: [13997]: info: All HA resources
relinquished.

Thanks.

----------
Thierry