Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all! Today I had a strange experience. node A: 192.168.100.100, cc1-vie /dev/drbd1: primary /dev/drbd5: primary node B: 192.168.100.101, cc1-sbg /dev/drbd1: secondary /dev/drbd5: secondary The /dev/drbdX devices are used by a xen domU. resource manager-ha { startup { become-primary-on cc1-vie; } on cc1-vie { device /dev/drbd1; disk /dev/mapper/cc1--vienna-manager--disk--drbd; address 192.168.100.100:7789; meta-disk internal; } on cc1-sbg { device /dev/drbd1; disk /dev/mapper/cc1--sbg-manager--disk--drbd; address 192.168.100.101:7789; meta-disk internal; } } resource cc-manager-templates-ha { startup { become-primary-on cc1-vie; } on cc1-vie { device /dev/drbd5; disk /dev/mapper/cc1--vienna-cc--manager--templates--drbd address 192.168.100.100:7793; meta-disk internal; } on cc1-sbg { device /dev/drbd5; disk /dev/mapper/cc1--sbg-cc--manager--templates--drbd address 192.168.100.101:7793; meta-disk internal; } } Everything was running fine. Then I rebooted both servers. Then I spotted: block drbd5: Starting worker thread (from cqueue [1573]) block drbd5: disk( Diskless -> Attaching ) block drbd5: Found 4 transactions (192 active extents) in activity log. block drbd5: Method to ensure write ordering: barrier block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10 block drbd5: max_segment_size ( = BIO size ) = 4096 block drbd5: drbd_bm_resize called with capacity == 41941688 block drbd5: resync bitmap: bits=5242711 words=81918 block drbd5: size = 20 GB (20970844 KB) block drbd5: recounting of set bits took additional 0 jiffies block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map. block drbd5: Marked additional 508 MB as out-of-sync based on AL. block drbd5: disk( Attaching -> UpToDate ) This is the first thing which makes me nervous: There were 500MB to synchronize although the server was idle and everything was synchronized before rebooting. Then I did some more reboots on node A and spotted again: block drbd5: Starting worker thread (from cqueue [1630]) block drbd5: disk( Diskless -> Attaching ) block drbd5: Found 4 transactions (126 active extents) in activity log. block drbd5: Method to ensure write ordering: barrier block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10 block drbd5: max_segment_size ( = BIO size ) = 4096 block drbd5: drbd_bm_resize called with capacity == 41941688 block drbd5: resync bitmap: bits=5242711 words=81918 block drbd5: size = 20 GB (20970844 KB) block drbd5: recounting of set bits took additional 0 jiffies block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map. block drbd5: Marked additional 488 MB as out-of-sync based on AL. block drbd5: disk( Attaching -> UpToDate ) So, why again resynchronizing almost 500Mb although the partition is not used at all (just mounted in a domU). Then some more reboots on node A and suddenly: block drbd5: State change failed: Refusing to be Primary without at least one UpToDate disk block drbd5: state = { cs:WFConnection ro:Secondary/Unknown ds:Diskless/DUnknown r--- } block drbd5: wanted = { cs:WFConnection ro:Primary/Unknown ds:Diskless/DUnknown r--- } block drbd5: State change failed: Refusing to be Primary without at least one UpToDate disk block drbd5: state = { cs:WFConnection ro:Secondary/Unknown ds:Diskless/DUnknown r--- } block drbd5: wanted = { cs:WFConnection ro:Primary/Unknown ds:Diskless/DUnknown r--- } Then the status on node A was: cc-manager-templates-ha Connected Primary/Secondary Diskless/UpToDate A r---- When I tried to manually attach the device I got error messages: "Split-Brain detected, dropping connection". After some googling without finding any hint suddenly the status changed: cc-manager-templates-ha StandAlone Primary/Unknown UpToDate/DUnknown r---- xen-vbd: _cc-manager So, suddenly this one device is not connected anymore. All the other drbd devices are still connected and working fine - only this single device is making problems, although it has identical configuration. What could cause such an issue? Everything was working fine, I just rebooted the servers. Any hints what to do now to solve this issue? thanks Klaus