Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
Problem:
cs:WFBitMapT/cs:WFBitMapS status remains after changes in network.
(Not syncing) (More detailed description below)
Environment:
Packages:
drbd82-8.2.6-1.el5.centos
kmod-drbd82-8.2.6-2
OS:
Linux node01 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008
x86_64 x86_64 x86_64 GNU/Linux
/proc/drbd:
Primary:
--------------------%<--------------------
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-x8664-build, 2008-10-03 11:30:17
0: cs:WFBitMapS st:Secondary/Secondary ds:UpToDate/Outdated C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:380
--------------------%<--------------------
Secondary:
--------------------%<--------------------
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by
buildsvn at c5-x8664-build, 2008-10-03 11:30:17
0: cs:WFBitMapT st:Secondary/Secondary ds:Outdated/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:72
--------------------%<--------------------
drbdadm get-gi all
Primary:
--------------------%<--------------------
A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004:1:1:0:1:0:1
--------------------%<--------------------
Secondary:
--------------------%<--------------------
CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004:1:0:0:1:0:0
--------------------%<--------------------
tail -n /var/log/messages | grep -i drbd
Primary:
--------------------%<--------------------
May 6 10:36:22 node01 kernel: drbd0: Split-Brain detected, dropping
connection!
May 6 10:36:22 node01 kernel: drbd0: self
A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
May 6 10:36:22 node01 kernel: drbd0: peer
CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
May 6 10:36:22 node01 kernel: drbd0: helper command: /sbin/drbdadm
split-brain
May 6 10:36:22 node01 kernel: drbd0: conn( WFReportParams ->
Disconnecting )
May 6 10:36:22 node01 kernel: drbd0: error receiving ReportState, l: 4!
May 6 10:36:22 node01 kernel: drbd0: asender terminated
May 6 10:36:22 node01 kernel: drbd0: Terminating asender thread
May 6 10:36:22 node01 kernel: drbd0: tl_clear()
May 6 10:36:22 node01 kernel: drbd0: Connection closed
May 6 10:36:22 node01 kernel: drbd0: conn( Disconnecting -> StandAlone )
May 6 10:36:22 node01 kernel: drbd0: receiver terminated
May 6 10:36:22 node01 kernel: drbd0: Terminating receiver thread
May 6 10:40:43 node01 kernel: drbd0: conn( StandAlone -> Unconnected )
May 6 10:40:43 node01 kernel: drbd0: Starting receiver thread (from
drbd0_worker [16858])
May 6 10:40:43 node01 kernel: drbd0: receiver (re)started
May 6 10:40:43 node01 kernel: drbd0: conn( Unconnected -> WFConnection )
May 6 10:40:53 node01 kernel: drbd0: Handshake successful: Agreed
network protocol version 88
May 6 10:40:53 node01 kernel: drbd0: conn( WFConnection ->
WFReportParams )
May 6 10:40:53 node01 kernel: drbd0: Starting asender thread (from
drbd0_receiver [17038])
May 6 10:40:53 node01 kernel: drbd0: data-integrity-alg: <not-used>
May 6 10:40:53 node01 kernel: drbd0: Split-Brain detected, manually
solved. Sync from this node
May 6 10:40:53 node01 kernel: drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS )
May 6 10:40:53 node01 kernel: drbd0: Writing meta data super block now.
May 6 10:41:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967295
May 6 10:41:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967294
May 6 10:41:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967293
May 6 10:41:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967292
May 6 10:41:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967291
May 6 10:41:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967290
May 6 10:41:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967289
May 6 10:41:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967288
May 6 10:41:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967287
May 6 10:41:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967286
May 6 10:42:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967285
May 6 10:42:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967284
May 6 10:42:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967283
May 6 10:42:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967282
May 6 10:42:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967281
May 6 10:42:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967280
May 6 10:42:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967279
May 6 10:42:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967278
May 6 10:42:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967277
May 6 10:42:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967276
May 6 10:43:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967275
May 6 10:43:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967274
May 6 10:43:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967273
May 6 10:43:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967272
May 6 10:43:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967271
May 6 10:43:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967270
May 6 10:43:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967269
May 6 10:43:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967268
May 6 10:43:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967267
May 6 10:43:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967266
May 6 10:44:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967265
May 6 10:44:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967264
May 6 10:44:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967263
May 6 10:44:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967262
May 6 10:44:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967261
May 6 10:44:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967260
May 6 10:44:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967259
May 6 10:44:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967258
May 6 10:44:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967257
May 6 10:44:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967256
May 6 10:45:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967255
May 6 10:45:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967254
May 6 10:45:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967253
May 6 10:45:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967252
May 6 10:45:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967251
May 6 10:45:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967250
May 6 10:45:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967249
May 6 10:45:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967248
May 6 10:45:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg
time expired, ko = 4294967247
--------------------%<--------------------
Secondary:
--------------------%<--------------------
May 6 10:36:22 node02 kernel: drbd0: Split-Brain detected, dropping
connection!
May 6 10:36:22 node02 kernel: drbd0: self
CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
May 6 10:36:22 node02 kernel: drbd0: peer
A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004
May 6 10:36:22 node02 kernel: drbd0: helper command: /sbin/drbdadm
split-brain
May 6 10:36:22 node02 kernel: drbd0: conn( WFReportParams ->
Disconnecting )
May 6 10:36:22 node02 kernel: drbd0: error receiving ReportState, l: 4!
May 6 10:36:22 node02 kernel: drbd0: asender terminated
May 6 10:36:22 node02 kernel: drbd0: Terminating asender thread
May 6 10:36:22 node02 kernel: drbd0: tl_clear()
May 6 10:36:22 node02 kernel: drbd0: Connection closed
May 6 10:36:22 node02 kernel: drbd0: conn( Disconnecting -> StandAlone )
May 6 10:36:22 node02 kernel: drbd0: receiver terminated
May 6 10:36:22 node02 kernel: drbd0: Terminating receiver thread
May 6 10:40:53 node02 kernel: drbd0: conn( StandAlone -> Unconnected )
May 6 10:40:53 node02 kernel: drbd0: Starting receiver thread (from
drbd0_worker [13338])
May 6 10:40:53 node02 kernel: drbd0: receiver (re)started
May 6 10:40:53 node02 kernel: drbd0: conn( Unconnected -> WFConnection )
May 6 10:40:53 node02 kernel: drbd0: Handshake successful: Agreed
network protocol version 88
May 6 10:40:53 node02 kernel: drbd0: conn( WFConnection ->
WFReportParams )
May 6 10:40:53 node02 kernel: drbd0: Starting asender thread (from
drbd0_receiver [14123])
May 6 10:40:53 node02 kernel: drbd0: data-integrity-alg: <not-used>
May 6 10:40:53 node02 kernel: drbd0: Split-Brain detected, manually
solved. Sync from peer node
May 6 10:40:53 node02 kernel: drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 6 10:40:53 node02 kernel: drbd0: Writing meta data super block now.
--------------------%<--------------------
Configuration:
--------------------%<--------------------
# /usr/share/doc/drbd82/drbd.conf
#
common {
syncer { rate 100M; }
protocol C;
}
resource drbd0 {
on node01 {
device /dev/drbd0;
disk /dev/datavg/drbdlv;
meta-disk internal;
address 192.168.1.1:8888;
}
on node02 {
device /dev/drbd0;
disk /dev/datavg/drbdlv;
meta-disk internal;
address 192.168.1.2:8888;
}
}
--------------------%<--------------------
It was working properly whilst on a test-network. Now that it is
connected to the production network, the nodes can see eachother, but
syncing does not work anymore.
No firewalls are in the way.
I use a seperate connection (over eth1) for the drbd sync.
I try to get rid of the split-brain by issuing: (as can be seen in the logs)
--------------------%<--------------------
drbdadm -- --discard-my-data connect drbd0
--------------------%<--------------------
How do I get out of this state and get the sync up and running again?
Thank you very much for your help!
Ger Apeldoorn