[DRBD-user] Cluster toggle with no reasons

benjamin.linier at engie.com benjamin.linier at engie.com
Wed May 4 15:36:57 CEST 2016

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I have over 10 sites in production with exactly the same standard M/S installation :

                2 Nodes (server1 and server2)
Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux
DrbD 8.3.11 (api:88/proto:86-96)
Corosync 1.4.2-3

All sites are exactly identic because we deploy them with an automatic installation DVD built with SimpleCDD.

We have a serious problem on 1 site, sometimes, the MASTER node switch from server1 to server2 with no reason, and return back to server1. Sometimes the system toggle 2 or 3 times before return back to normal state.

This issue is not periodic. Sometimes it's happened after 2mounth of stability, or it can happened 15days after the last time.

This situation is critical because it can happened that the toggle corrupts some data, this is reflected by MySQL tables marked as crashed. (and our software stops)

Could you help to determine the possible root causes why the cluster become instable ?

I Suspected first the LAN but I done some tests in Labs, and when we make errors on the LAN we have in the log something like   "conn( WFConnection -> NetworkFailure )". It's not the case in production site. LAN semms to be OK.

Here is the production logs for server1 and server2 :

Server1 :
May  1 08:15:13 server1 kernel: [3865117.570629] block drbd0: role( Primary -> Secondary )
May  1 08:15:13 server1 kernel: [3865117.570661] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
May  1 08:15:13 server1 kernel: [3865117.570671] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May  1 08:15:14 server1 kernel: [3865117.842211] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
May  1 08:15:14 server1 kernel: [3865117.842384] block drbd0: asender terminated
May  1 08:15:14 server1 kernel: [3865117.842389] block drbd0: Terminating drbd0_asender
May  1 08:15:14 server1 kernel: [3865117.842486] block drbd0: Connection closed
May  1 08:15:14 server1 kernel: [3865117.842504] block drbd0: conn( Disconnecting -> StandAlone )
May  1 08:15:14 server1 kernel: [3865117.842587] block drbd0: receiver terminated
May  1 08:15:14 server1 kernel: [3865117.842591] block drbd0: Terminating drbd0_receiver
May  1 08:15:14 server1 kernel: [3865117.842595] block drbd0: disk( UpToDate -> Failed )
May  1 08:15:14 server1 kernel: [3865117.842633] block drbd0: disk( Failed -> Diskless )
May  1 08:15:14 server1 kernel: [3865117.842665] block drbd0: drbd_bm_resize called with capacity == 0
May  1 08:15:14 server1 kernel: [3865117.842674] block drbd0: worker terminated
May  1 08:15:14 server1 kernel: [3865117.842678] block drbd0: Terminating drbd0_worker
May  1 08:17:42 server1 kernel: [3865265.972144] block drbd0: Starting worker thread (from drbdsetup [57250])
May  1 08:17:42 server1 kernel: [3865265.972269] block drbd0: disk( Diskless -> Attaching )
May  1 08:17:42 server1 kernel: [3865265.975419] block drbd0: Found 4 transactions (192 active extents) in activity log.
May  1 08:17:42 server1 kernel: [3865265.975426] block drbd0: Method to ensure write ordering: flush
May  1 08:17:42 server1 kernel: [3865265.975434] block drbd0: drbd_bm_resize called with capacity == 3948344
May  1 08:17:42 server1 kernel: [3865265.975459] block drbd0: resync bitmap: bits=493543 words=7712 pages=16
May  1 08:17:42 server1 kernel: [3865265.975464] block drbd0: size = 1928 MB (1974172 KB)
May  1 08:17:42 server1 kernel: [3865265.975795] block drbd0: bitmap READ of 16 pages took 0 jiffies
May  1 08:17:42 server1 kernel: [3865265.975848] block drbd0: recounting of set bits took additional 0 jiffies
May  1 08:17:42 server1 kernel: [3865265.975853] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May  1 08:17:42 server1 kernel: [3865265.975861] block drbd0: disk( Attaching -> UpToDate )
May  1 08:17:42 server1 kernel: [3865265.975866] block drbd0: attached to UUIDs 089C9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6
May  1 08:17:42 server1 kernel: [3865265.990736] block drbd0: conn( StandAlone -> Unconnected )
May  1 08:17:42 server1 kernel: [3865265.990760] block drbd0: Starting receiver thread (from drbd0_worker [57251])
May  1 08:17:42 server1 kernel: [3865265.990899] block drbd0: receiver (re)started
May  1 08:17:42 server1 kernel: [3865265.990909] block drbd0: conn( Unconnected -> WFConnection )
May  1 08:17:42 server1 kernel: [3865266.489465] block drbd0: Handshake successful: Agreed network protocol version 96
May  1 08:17:42 server1 kernel: [3865266.489795] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
May  1 08:17:42 server1 kernel: [3865266.489808] block drbd0: conn( WFConnection -> WFReportParams )
May  1 08:17:42 server1 kernel: [3865266.489921] block drbd0: Starting asender thread (from drbd0_receiver [57283])
May  1 08:17:42 server1 kernel: [3865266.490137] block drbd0: data-integrity-alg: <not-used>
May  1 08:17:42 server1 kernel: [3865266.490167] block drbd0: drbd_sync_handshake:
May  1 08:17:42 server1 kernel: [3865266.490173] block drbd0: self 089C9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6 bits:0 flags:0
May  1 08:17:42 server1 kernel: [3865266.490178] block drbd0: peer 8FE26C139FB94071:089C9C45FDE4ABBC:62F362153A7923B6:62F262153A7923B6 bits:297 flags:0
May  1 08:17:42 server1 kernel: [3865266.490183] block drbd0: uuid_compare()=-1 by rule 50
May  1 08:17:42 server1 kernel: [3865266.490193] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
May  1 08:17:42 server1 kernel: [3865266.492311] block drbd0: conn( WFBitMapT -> WFSyncUUID )
May  1 08:17:42 server1 kernel: [3865266.496178] block drbd0: updated sync uuid 089D9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6
May  1 08:17:42 server1 kernel: [3865266.496315] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
May  1 08:17:42 server1 kernel: [3865266.499086] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
May  1 08:17:42 server1 kernel: [3865266.499097] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
May  1 08:17:42 server1 kernel: [3865266.499110] block drbd0: Began resync as SyncTarget (will sync 1188 KB [297 bits set]).
May  1 08:17:42 server1 kernel: [3865266.558764] block drbd0: Resync done (total 1 sec; paused 0 sec; 1188 K/sec)
May  1 08:17:42 server1 kernel: [3865266.558774] block drbd0: updated UUIDs 8FE26C139FB94070:0000000000000000:089D9C45FDE4ABBC:089C9C45FDE4ABBC
May  1 08:17:42 server1 kernel: [3865266.558782] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
May  1 08:17:42 server1 kernel: [3865266.558838] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
May  1 08:17:42 server1 kernel: [3865266.561386] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
May  1 08:17:42 server1 kernel: [3865266.561574] block drbd0: bitmap WRITE of 10 pages took 0 jiffies
May  1 08:17:42 server1 kernel: [3865266.561583] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May  1 08:17:49 server1 kernel: [3865273.492362] block drbd0: peer( Primary -> Secondary )
May  1 08:17:50 server1 kernel: [3865273.759294] block drbd0: role( Secondary -> Primary )
May  1 08:17:50 server1 kernel: [3865273.953401] EXT4-fs (drbd0): mounted filesystem with ordered data mode. Opts: (null)

Server2:
May  1 06:47:02 server2 lpd[23019]: restarted
May  1 08:08:36 server2 kernel: [3865172.121119] block drbd0: peer( Primary -> Secondary )
May  1 08:08:36 server2 kernel: [3865172.357331] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
May  1 08:08:36 server2 kernel: [3865172.357554] block drbd0: asender terminated
May  1 08:08:36 server2 kernel: [3865172.357562] block drbd0: Terminating drbd0_asender
May  1 08:08:36 server2 kernel: [3865172.357739] block drbd0: Connection closed
May  1 08:08:36 server2 kernel: [3865172.357749] block drbd0: conn( TearDown -> Unconnected )
May  1 08:08:36 server2 kernel: [3865172.357759] block drbd0: receiver terminated
May  1 08:08:36 server2 kernel: [3865172.357762] block drbd0: Restarting drbd0_receiver
May  1 08:08:36 server2 kernel: [3865172.357766] block drbd0: receiver (re)started
May  1 08:08:36 server2 kernel: [3865172.357771] block drbd0: conn( Unconnected -> WFConnection )
May  1 08:08:36 server2 kernel: [3865172.601231] block drbd0: role( Secondary -> Primary )
May  1 08:08:36 server2 kernel: [3865172.601432] block drbd0: new current UUID 8FE26C139FB94071:089C9C45FDE4ABBC:62F362153A7923B6:62F262153A7923B6
May  1 08:08:36 server2 kernel: [3865172.785933] EXT4-fs (drbd0): mounted filesystem with ordered data mode. Opts: (null)
May  1 08:11:05 server2 kernel: [3865321.006344] block drbd0: Handshake successful: Agreed network protocol version 96
May  1 08:11:05 server2 kernel: [3865321.006741] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
May  1 08:11:05 server2 kernel: [3865321.006754] block drbd0: conn( WFConnection -> WFReportParams )
May  1 08:11:05 server2 kernel: [3865321.006890] block drbd0: Starting asender thread (from drbd0_receiver [6537])
May  1 08:11:05 server2 kernel: [3865321.007160] block drbd0: data-integrity-alg: <not-used>
May  1 08:11:05 server2 kernel: [3865321.007199] block drbd0: drbd_sync_handshake:
May  1 08:11:05 server2 kernel: [3865321.007205] block drbd0: self 8FE26C139FB94071:089C9C45FDE4ABBC:62F362153A7923B6:62F262153A7923B6 bits:297 flags:0
May  1 08:11:05 server2 kernel: [3865321.007224] block drbd0: peer 089C9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6 bits:0 flags:0
May  1 08:11:05 server2 kernel: [3865321.007234] block drbd0: uuid_compare()=1 by rule 70
May  1 08:11:05 server2 kernel: [3865321.007244] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
May  1 08:11:05 server2 kernel: [3865321.010000] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
May  1 08:11:05 server2 kernel: [3865321.012921] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
May  1 08:11:05 server2 kernel: [3865321.012932] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
May  1 08:11:05 server2 kernel: [3865321.012942] block drbd0: Began resync as SyncSource (will sync 1188 KB [297 bits set]).
May  1 08:11:05 server2 kernel: [3865321.012958] block drbd0: updated sync UUID 8FE26C139FB94071:089D9C45FDE4ABBC:089C9C45FDE4ABBC:62F362153A7923B6
May  1 08:11:05 server2 kernel: [3865321.076076] block drbd0: Resync done (total 1 sec; paused 0 sec; 1188 K/sec)
May  1 08:11:05 server2 kernel: [3865321.076086] block drbd0: updated UUIDs 8FE26C139FB94071:0000000000000000:089D9C45FDE4ABBC:089C9C45FDE4ABBC
May  1 08:11:05 server2 kernel: [3865321.076096] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
May  1 08:11:05 server2 kernel: [3865321.076342] block drbd0: bitmap WRITE of 10 pages took 0 jiffies
May  1 08:11:05 server2 kernel: [3865321.076351] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May  1 08:11:12 server2 kernel: [3865328.009119] block drbd0: role( Primary -> Secondary )
May  1 08:11:12 server2 kernel: [3865328.009190] block drbd0: bitmap WRITE of 0 pages took 0 jiffies
May  1 08:11:12 server2 kernel: [3865328.009202] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May  1 08:11:12 server2 kernel: [3865328.276278] block drbd0: peer( Secondary -> Primary )


Thanks for your answers,

Best regards,

Benjamin Linier
ENGIE Mail Disclaimer: http://www.engie.com/disclaimer/disclaimer-fr.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160504/b8fac6bb/attachment.htm>


More information about the drbd-user mailing list