[DRBD-user] Verify resync going split-brain

Dan Barker dbarker at visioncomm.net
Wed May 26 20:11:19 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I reported this issue a month ago, and didn't hear anything back on the
list. It's happened again. The latest iteration is top-posted here, the
April 28 version is below that. The configuration files in the April 28
version are still valid, except there are a few more LUN and resources.

Issue. I do a verify, and OOS sectors are discovered.
I do a drbdadm disconnect r4
I do a drbdadm connect r4
The resource goes split-brain, rather than resync the 8K oos.

Normal split-brain recovery works fine, but I thought that wouldn't be
necessary when following the doc at:
http://www.drbd.org/users-guide-emb/s-use-online-verify.html.

Commands issued on Node A (and it's logs):

drbdadm verify r3

  [1523317.582034] block drbd4: conn( Connected -> VerifyS )
  [1523317.582037] block drbd4: Starting Online Verify from sector 0
  [1523801.408745] block drbd4: Out of sync: start=24763592, size=8
(sectors)
  [1526013.538472] block drbd4: Out of sync: start=137810736, size=8
(sectors)
  [1527419.556042] block drbd4: Online verify  done (total 4101 sec; paused
0 sec; 25564 K/sec)
  [1527419.556065] block drbd4: Online verify found 2 4k block out of sync!
  [1527419.556065] block drbd4: conn( VerifyS -> Connected )
  [1527419.556195] block drbd4: Writing the whole bitmap, due to failed
kmalloc
  [1527419.556217] block drbd4: helper command: /sbin/drbdadm out-of-sync
minor-4
  [1527419.560506] block drbd4: helper command: /sbin/drbdadm out-of-sync
minor-4 exit code 0 (0x0)
  [1527419.575527] block drbd4: 8 KB (2 bits) marked out-of-sync by on disk
bit-map.

drbdadm disconnect r3

  [1527865.002618] block drbd4: peer( Primary -> Unknown ) conn( Connected
-> Disconnecting ) pdsk( UpToDate -> DUnknown )
  [1527865.002645] block drbd4: Creating new current UUID
  [1527865.002696] block drbd4: short read expecting header on sock: r=-512
  [1527865.002841] block drbd4: asender terminated
  [1527865.002848] block drbd4: Terminating asender thread
  [1527865.003239] block drbd4: Connection closed
  [1527865.003338] block drbd4: conn( Disconnecting -> StandAlone )
  [1527865.003372] block drbd4: receiver terminated
  [1527865.003375] block drbd4: Terminating receiver thread

drbdadm connect r3

  [1527871.486995] block drbd4: conn( StandAlone -> Unconnected )
  [1527871.498482] block drbd4: Starting receiver thread (from drbd4_worker
[15413])
  [1527871.498529] block drbd4: receiver (re)started
  [1527871.498534] block drbd4: conn( Unconnected -> WFConnection )
  [1527871.599937] block drbd4: Handshake successful: Agreed network
protocol version 91
  [1527871.599946] block drbd4: conn( WFConnection -> WFReportParams )
  [1527871.599968] block drbd4: Starting asender thread (from drbd4_receiver
[13042])
  [1527871.600018] block drbd4: data-integrity-alg: <not-used>
  [1527871.600827] block drbd4: drbd_sync_handshake:
  [1527871.600835] block drbd4: self
54863BBA014906E1:10B6A7002CAF39CD:A345B84315709F38:3849A70BBB2CFDCB bits:12
flags:0
  [1527871.600840] block drbd4: peer
2CD31480994ED3D9:10B6A7002CAF39CD:A345B84315709F39:3849A70BBB2CFDCB bits:2
flags:0
  [1527871.600844] block drbd4: uuid_compare()=100 by rule 90
  [1527871.600847] block drbd4: Split-Brain detected, dropping connection!

Logs at Node B:

  [51070.850655] block drbd4: conn( Connected -> VerifyT )
  [51070.860592] block drbd4: Online Verify start sector: 0
  [51554.622312] block drbd4: Out of sync: start=24763592, size=8 (sectors)
  [53766.446901] block drbd4: Out of sync: start=137810736, size=8 (sectors)
  [55172.272044] block drbd4: Online verify  done (total 4101 sec; paused 0
sec; 25564 K/sec)
  [55172.272044] block drbd4: Online verify found 2 4k block out of sync!
  [55172.272044] block drbd4: conn( VerifyT -> Connected )
  [55172.272044] block drbd4: Writing the whole bitmap, due to failed
kmalloc
  [55172.272044] block drbd4: helper command: /sbin/drbdadm out-of-sync
minor-4
  [55172.289073] block drbd4: helper command: /sbin/drbdadm out-of-sync
minor-4 exit code 0 (0x0)
  [55172.305203] block drbd4: 8 KB (2 bits) marked out-of-sync by on disk
bit-map.
  [55617.661601] block drbd4: peer( Primary -> Unknown ) conn( Connected ->
TearDown ) pdsk( UpToDate -> DUnknown )
  [55617.661601] block drbd4: Creating new current UUID
  [55617.661601] block drbd4: meta connection shut down by peer.
  [55617.661601] block drbd4: asender terminated
  [55617.661601] block drbd4: Terminating asender thread
  [55617.661601] block drbd4: Connection closed
  [55617.661601] block drbd4: conn( TearDown -> Unconnected )
  [55617.661601] block drbd4: receiver terminated
  [55617.661601] block drbd4: Restarting receiver thread
  [55617.661601] block drbd4: receiver (re)started
  [55617.661601] block drbd4: conn( Unconnected -> WFConnection )
  [55624.264341] block drbd4: Handshake successful: Agreed network protocol
version 91
  [55624.264371] block drbd4: conn( WFConnection -> WFReportParams )
  [55624.264445] block drbd4: Starting asender thread (from drbd4_receiver
[3082])
  [55624.265771] block drbd4: data-integrity-alg: <not-used>
  [55624.266496] block drbd4: drbd_sync_handshake:
  [55624.266537] block drbd4: self
2CD31480994ED3D9:10B6A7002CAF39CD:A345B84315709F39:3849A70BBB2CFDCB bits:2
flags:0
  [55624.266580] block drbd4: peer
54863BBA014906E1:10B6A7002CAF39CD:A345B84315709F38:3849A70BBB2CFDCB bits:12
flags:0
  [55624.266672] block drbd4: uuid_compare()=100 by rule 90
  [55624.266747] block drbd4: Split-Brain detected, dropping connection!

Split brain recovery implemented in the conventional way, but why was it
necessary?

Dan Barker
Atlanta

-----Original Message-----
From: drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Dan Barker
Sent: Wednesday, April 28, 2010 9:15 AM
To: drbd-user at lists.linbit.com
Subject: [DRBD-user] Verify resync going split-brain

I read the
doc(http://www.drbd.org/users-guide-emb/s-use-online-verify.html), and it
says to resync after a verify failure via:

drbdadm disconnect
drbdadm connect

I did a verify on my main node (DRBD0, below) and it found some OOS blocks,
Since the disconnect causes the node to become unavailable, I did the
disconnect/reconnect on the other node. This resulted in a split-brain. So,
I infer that the disconnect/connect must be on the same node as the node
that requested the verify.

The next time, I did the verify on the second node (DRBD1) and the
disconnect/connect on the same node. Again, split brain.

What am I doing wrong?

Dan Barker

Configuration files below, dmesg output inline.

DRBD0 (172.30.0.40) is Primary, iSCSITarget Active for I/O
DRBD1 (172.30.0.41) is Primary, iSCSITarget Active (idle)

DRBD1 verify all
  block drbd3: conn( Connected -> VerifyS )
  block drbd3: Starting Online Verify from sector 0
  block drbd1: conn( Connected -> VerifyS )
  block drbd1: Starting Online Verify from sector 0
  block drbd2: conn( Connected -> VerifyS )
  block drbd2: Starting Online Verify from sector 0
  block drbd2: Online verify  done (total 10375 sec; paused 0 sec; 25264
K/sec)
  block drbd2: conn( VerifyS -> Connected )
  block drbd3: Out of sync: start=722699032, size=8 (sectors)
  block drbd3: Out of sync: start=722709312, size=8 (sectors)
  block drbd3: Out of sync: start=751133504, size=8 (sectors)
  block drbd1: Online verify  done (total 16418 sec; paused 0 sec; 25544
K/sec)
  block drbd1: conn( VerifyS -> Connected )
  block drbd3: Online verify  done (total 16880 sec; paused 0 sec; 24844
K/sec)
  block drbd3: Online verify found 3 4k block out of sync!
  block drbd3: conn( VerifyS -> Connected )
  block drbd3: Writing the whole bitmap, due to failed kmalloc
  block drbd3: helper command: /sbin/drbdadm out-of-sync minor-3
  block drbd3: helper command: /sbin/drbdadm out-of-sync minor-3 exit code 0
(0x0)
  block drbd3: 12 KB (3 bits) marked out-of-sync by on disk bit-map.

DRBD1 iscsiitarget stop
  iscsi_trgt: Removing all connections, sessions and targets

DRBD1 disconnect r2
  block drbd3: peer( Primary -> Unknown ) conn( Connected -> Disconnecting )
pdsk( UpToDate -> DUnknown )
  block drbd3: Creating new current UUID
  block drbd3: meta connection shut down by peer.
  block drbd3: asender terminated
  block drbd3: Terminating asender thread
  block drbd3: Connection closed
  block drbd3: conn( Disconnecting -> StandAlone )
  block drbd3: receiver terminated
  block drbd3: Terminating receiver thread

DRBD1 connect r2
  block drbd3: conn( StandAlone -> Unconnected )
  block drbd3: Starting receiver thread (from drbd3_worker [2936])
  block drbd3: receiver (re)started
  block drbd3: conn( Unconnected -> WFConnection )
  block drbd3: Handshake successful: Agreed network protocol version 91
  block drbd3: conn( WFConnection -> WFReportParams )
  block drbd3: Starting asender thread (from drbd3_receiver [6831])
  block drbd3: data-integrity-alg: <not-used>
  block drbd3: drbd_sync_handshake:
  block drbd3: self
0AA964F1A2B09569:5582778FF39681F5:956F8322C0AB90FB:4C3D617FEA1E0097 bits:3
flags:0
  block drbd3: peer
B32E410E11613899:5582778FF39681F5:956F8322C0AB90FA:4C3D617FEA1E0097 bits:193
flags:0
  block drbd3: uuid_compare()=100 by rule 90
  block drbd3: Split-Brain detected, dropping connection!
  block drbd3: helper command: /sbin/drbdadm split-brain minor-3
  block drbd3: helper command: /sbin/drbdadm split-brain minor-3 exit code 0
(0x0)
  block drbd3: conn( WFReportParams -> Disconnecting )
  block drbd3: error receiving ReportState, l: 4!
  block drbd3: asender terminated
  block drbd3: Terminating asender thread
  block drbd3: Connection closed
  block drbd3: conn( Disconnecting -> StandAlone )
  block drbd3: receiver terminated
  block drbd3: Terminating receiver thread

DRBD1 secondary r2
  block drbd3: role( Primary -> Secondary )

DRBD1 -- --discard-my-data connect r2
  block drbd3: conn( StandAlone -> Unconnected )
  block drbd3: Starting receiver thread (from drbd3_worker [2936])
  block drbd3: receiver (re)started
  block drbd3: conn( Unconnected -> WFConnection )

DRBD0 connect r2
  block drbd3: Handshake successful: Agreed network protocol version 91
  block drbd3: conn( WFConnection -> WFReportParams )
  block drbd3: Starting asender thread (from drbd3_receiver [6854])
  block drbd3: data-integrity-alg: <not-used>
  block drbd3: drbd_sync_handshake:
  block drbd3: self
0AA964F1A2B09568:5582778FF39681F5:956F8322C0AB90FB:4C3D617FEA1E0097 bits:3
flags:0
  block drbd3: peer
B32E410E11613899:5582778FF39681F5:956F8322C0AB90FA:4C3D617FEA1E0097
bits:1283 flags:0
  block drbd3: uuid_compare()=100 by rule 90
  block drbd3: Split-Brain detected, manually solved. Sync from peer node
  block drbd3: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT
) pdsk( DUnknown -> UpToDate )
  block drbd3: conn( WFBitMapT -> WFSyncUUID )
  block drbd3: helper command: /sbin/drbdadm before-resync-target minor-3
  block drbd3: helper command: /sbin/drbdadm before-resync-target minor-3
exit code 0 (0x0)
  block drbd3: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate ->
Inconsistent )
  block drbd3: Began resync as SyncTarget (will sync 5132 KB [1283 bits
set]).

  block drbd3: Resync done (total 3 sec; paused 0 sec; 1708 K/sec)
  block drbd3: 6 % had equal check sums, eliminated: 352K; transferred 4780K
total 5132K
  block drbd3: conn( SyncTarget -> Connected ) disk( Inconsistent ->
UpToDate )
  block drbd3: helper command: /sbin/drbdadm after-resync-target minor-3
  block drbd3: helper command: /sbin/drbdadm after-resync-target minor-3
exit code 0 (0x0)

DRBD1 primary r2
  block drbd3: role( Secondary -> Primary )

DRBD1 iscsitarget start
  iSCSI Enterprise Target Software - version 1.4.20
  iscsi_trgt: Registered io type fileio
  iscsi_trgt: Registered io type blockio
  iscsi_trgt: Registered io type nullio

All is back to normal.

CONFIGURATION FILES:

/etc/iet/ietd.conf
Target iqn.2010-03.com.visioncomm.Storage00:Storage00
Lun 0 Path=/dev/drbd1,Type=blockio,ScsiSN=SPIDSK-090311-00
Lun 1 Path=/dev/drbd2,Type=blockio,ScsiSN=SPIDSK-090312-00
Lun 2 Path=/dev/drbd3,Type=blockio,ScsiSN=SPIDSK-090319-00



/etc/drbd.d/global_common.conf
global { usage-count yes; }
common { protocol C;
  handlers {
    pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
    pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
    local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
  }
  startup { }
  disk { }
  net { allow-two-primaries; }
  syncer {
    csums-alg md5;
    rate 25M;
    verify-alg md5;
  }
}

/etc/drbd.d/r0.res
resource r0 {
  startup { become-primary-on both; }
  device    /dev/drbd1;
  disk      /dev/sdb;
  meta-disk internal;
  on Storage00 {
    address   172.30.0.40:7789;
  }
  on Storage01 {
    address   172.30.0.41:7789;
  }
}

/etc/drbd.d/r1.res
resource r1 {
  startup { become-primary-on both; }
  device    /dev/drbd2;
  disk      /dev/sdc;
  meta-disk internal;
  on Storage00 {
    address   172.30.0.40:7790;
  }
  on Storage01 {
    address   172.30.0.41:7790;
  }
}

/etc/drbd.d/r2.res
resource r2 {
  startup { become-primary-on both; }
  device    /dev/drbd3;
  disk      /dev/sdd;
  meta-disk internal;
  on Storage00 {
    address   172.30.0.40:7791;
  }
  on Storage01 {
    address   172.30.0.41:7791;
  }
}


_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user




More information about the drbd-user mailing list