Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Problem: cs:WFBitMapT/cs:WFBitMapS status remains after changes in network. (Not syncing) (More detailed description below) Environment: Packages: drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 OS: Linux node01 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux /proc/drbd: Primary: --------------------%<-------------------- version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-x8664-build, 2008-10-03 11:30:17 0: cs:WFBitMapS st:Secondary/Secondary ds:UpToDate/Outdated C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:380 --------------------%<-------------------- Secondary: --------------------%<-------------------- version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn at c5-x8664-build, 2008-10-03 11:30:17 0: cs:WFBitMapT st:Secondary/Secondary ds:Outdated/UpToDate C r--- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 oos:72 --------------------%<-------------------- drbdadm get-gi all Primary: --------------------%<-------------------- A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004:1:1:0:1:0:1 --------------------%<-------------------- Secondary: --------------------%<-------------------- CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004:1:0:0:1:0:0 --------------------%<-------------------- tail -n /var/log/messages | grep -i drbd Primary: --------------------%<-------------------- May 6 10:36:22 node01 kernel: drbd0: Split-Brain detected, dropping connection! May 6 10:36:22 node01 kernel: drbd0: self A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004 May 6 10:36:22 node01 kernel: drbd0: peer CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004 May 6 10:36:22 node01 kernel: drbd0: helper command: /sbin/drbdadm split-brain May 6 10:36:22 node01 kernel: drbd0: conn( WFReportParams -> Disconnecting ) May 6 10:36:22 node01 kernel: drbd0: error receiving ReportState, l: 4! May 6 10:36:22 node01 kernel: drbd0: asender terminated May 6 10:36:22 node01 kernel: drbd0: Terminating asender thread May 6 10:36:22 node01 kernel: drbd0: tl_clear() May 6 10:36:22 node01 kernel: drbd0: Connection closed May 6 10:36:22 node01 kernel: drbd0: conn( Disconnecting -> StandAlone ) May 6 10:36:22 node01 kernel: drbd0: receiver terminated May 6 10:36:22 node01 kernel: drbd0: Terminating receiver thread May 6 10:40:43 node01 kernel: drbd0: conn( StandAlone -> Unconnected ) May 6 10:40:43 node01 kernel: drbd0: Starting receiver thread (from drbd0_worker [16858]) May 6 10:40:43 node01 kernel: drbd0: receiver (re)started May 6 10:40:43 node01 kernel: drbd0: conn( Unconnected -> WFConnection ) May 6 10:40:53 node01 kernel: drbd0: Handshake successful: Agreed network protocol version 88 May 6 10:40:53 node01 kernel: drbd0: conn( WFConnection -> WFReportParams ) May 6 10:40:53 node01 kernel: drbd0: Starting asender thread (from drbd0_receiver [17038]) May 6 10:40:53 node01 kernel: drbd0: data-integrity-alg: <not-used> May 6 10:40:53 node01 kernel: drbd0: Split-Brain detected, manually solved. Sync from this node May 6 10:40:53 node01 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) May 6 10:40:53 node01 kernel: drbd0: Writing meta data super block now. May 6 10:41:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967295 May 6 10:41:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967294 May 6 10:41:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967293 May 6 10:41:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967292 May 6 10:41:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967291 May 6 10:41:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967290 May 6 10:41:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967289 May 6 10:41:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967288 May 6 10:41:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967287 May 6 10:41:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967286 May 6 10:42:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967285 May 6 10:42:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967284 May 6 10:42:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967283 May 6 10:42:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967282 May 6 10:42:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967281 May 6 10:42:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967280 May 6 10:42:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967279 May 6 10:42:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967278 May 6 10:42:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967277 May 6 10:42:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967276 May 6 10:43:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967275 May 6 10:43:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967274 May 6 10:43:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967273 May 6 10:43:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967272 May 6 10:43:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967271 May 6 10:43:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967270 May 6 10:43:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967269 May 6 10:43:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967268 May 6 10:43:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967267 May 6 10:43:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967266 May 6 10:44:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967265 May 6 10:44:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967264 May 6 10:44:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967263 May 6 10:44:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967262 May 6 10:44:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967261 May 6 10:44:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967260 May 6 10:44:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967259 May 6 10:44:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967258 May 6 10:44:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967257 May 6 10:44:59 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967256 May 6 10:45:05 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967255 May 6 10:45:11 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967254 May 6 10:45:17 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967253 May 6 10:45:23 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967252 May 6 10:45:29 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967251 May 6 10:45:35 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967250 May 6 10:45:41 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967249 May 6 10:45:47 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967248 May 6 10:45:53 node01 kernel: drbd0: [drbd0_worker/16858] sock_sendmsg time expired, ko = 4294967247 --------------------%<-------------------- Secondary: --------------------%<-------------------- May 6 10:36:22 node02 kernel: drbd0: Split-Brain detected, dropping connection! May 6 10:36:22 node02 kernel: drbd0: self CDCCE6A3D5BA3FFC:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004 May 6 10:36:22 node02 kernel: drbd0: peer A6DAADF5FB98B6D0:95905CADE9DCFB23:3CF9226F2C03C538:0000000000000004 May 6 10:36:22 node02 kernel: drbd0: helper command: /sbin/drbdadm split-brain May 6 10:36:22 node02 kernel: drbd0: conn( WFReportParams -> Disconnecting ) May 6 10:36:22 node02 kernel: drbd0: error receiving ReportState, l: 4! May 6 10:36:22 node02 kernel: drbd0: asender terminated May 6 10:36:22 node02 kernel: drbd0: Terminating asender thread May 6 10:36:22 node02 kernel: drbd0: tl_clear() May 6 10:36:22 node02 kernel: drbd0: Connection closed May 6 10:36:22 node02 kernel: drbd0: conn( Disconnecting -> StandAlone ) May 6 10:36:22 node02 kernel: drbd0: receiver terminated May 6 10:36:22 node02 kernel: drbd0: Terminating receiver thread May 6 10:40:53 node02 kernel: drbd0: conn( StandAlone -> Unconnected ) May 6 10:40:53 node02 kernel: drbd0: Starting receiver thread (from drbd0_worker [13338]) May 6 10:40:53 node02 kernel: drbd0: receiver (re)started May 6 10:40:53 node02 kernel: drbd0: conn( Unconnected -> WFConnection ) May 6 10:40:53 node02 kernel: drbd0: Handshake successful: Agreed network protocol version 88 May 6 10:40:53 node02 kernel: drbd0: conn( WFConnection -> WFReportParams ) May 6 10:40:53 node02 kernel: drbd0: Starting asender thread (from drbd0_receiver [14123]) May 6 10:40:53 node02 kernel: drbd0: data-integrity-alg: <not-used> May 6 10:40:53 node02 kernel: drbd0: Split-Brain detected, manually solved. Sync from peer node May 6 10:40:53 node02 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 6 10:40:53 node02 kernel: drbd0: Writing meta data super block now. --------------------%<-------------------- Configuration: --------------------%<-------------------- # /usr/share/doc/drbd82/drbd.conf # common { syncer { rate 100M; } protocol C; } resource drbd0 { on node01 { device /dev/drbd0; disk /dev/datavg/drbdlv; meta-disk internal; address 192.168.1.1:8888; } on node02 { device /dev/drbd0; disk /dev/datavg/drbdlv; meta-disk internal; address 192.168.1.2:8888; } } --------------------%<-------------------- It was working properly whilst on a test-network. Now that it is connected to the production network, the nodes can see eachother, but syncing does not work anymore. No firewalls are in the way. I use a seperate connection (over eth1) for the drbd sync. I try to get rid of the split-brain by issuing: (as can be seen in the logs) --------------------%<-------------------- drbdadm -- --discard-my-data connect drbd0 --------------------%<-------------------- How do I get out of this state and get the sync up and running again? Thank you very much for your help! Ger Apeldoorn