Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all!
Today I had a strange experience.
node A: 192.168.100.100, cc1-vie
/dev/drbd1: primary
/dev/drbd5: primary
node B: 192.168.100.101, cc1-sbg
/dev/drbd1: secondary
/dev/drbd5: secondary
The /dev/drbdX devices are used by a xen domU.
resource manager-ha {
startup {
become-primary-on cc1-vie;
}
on cc1-vie {
device /dev/drbd1;
disk /dev/mapper/cc1--vienna-manager--disk--drbd;
address 192.168.100.100:7789;
meta-disk internal;
}
on cc1-sbg {
device /dev/drbd1;
disk /dev/mapper/cc1--sbg-manager--disk--drbd;
address 192.168.100.101:7789;
meta-disk internal;
}
}
resource cc-manager-templates-ha {
startup {
become-primary-on cc1-vie;
}
on cc1-vie {
device /dev/drbd5;
disk /dev/mapper/cc1--vienna-cc--manager--templates--drbd
address 192.168.100.100:7793;
meta-disk internal;
}
on cc1-sbg {
device /dev/drbd5;
disk /dev/mapper/cc1--sbg-cc--manager--templates--drbd
address 192.168.100.101:7793;
meta-disk internal;
}
}
Everything was running fine. Then I rebooted both servers. Then I spotted:
block drbd5: Starting worker thread (from cqueue [1573])
block drbd5: disk( Diskless -> Attaching )
block drbd5: Found 4 transactions (192 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 508 MB as out-of-sync based on AL.
block drbd5: disk( Attaching -> UpToDate )
This is the first thing which makes me nervous: There were 500MB to
synchronize although the server was idle and everything was synchronized
before rebooting.
Then I did some more reboots on node A and spotted again:
block drbd5: Starting worker thread (from cqueue [1630])
block drbd5: disk( Diskless -> Attaching )
block drbd5: Found 4 transactions (126 active extents) in activity log.
block drbd5: Method to ensure write ordering: barrier
block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
block drbd5: max_segment_size ( = BIO size ) = 4096
block drbd5: drbd_bm_resize called with capacity == 41941688
block drbd5: resync bitmap: bits=5242711 words=81918
block drbd5: size = 20 GB (20970844 KB)
block drbd5: recounting of set bits took additional 0 jiffies
block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
block drbd5: Marked additional 488 MB as out-of-sync based on AL.
block drbd5: disk( Attaching -> UpToDate )
So, why again resynchronizing almost 500Mb although the partition is not
used at all (just mounted in a domU).
Then some more reboots on node A and suddenly:
block drbd5: State change failed: Refusing to be Primary without at
least one UpToDate disk
block drbd5: state = { cs:WFConnection ro:Secondary/Unknown
ds:Diskless/DUnknown r--- }
block drbd5: wanted = { cs:WFConnection ro:Primary/Unknown
ds:Diskless/DUnknown r--- }
block drbd5: State change failed: Refusing to be Primary without at
least one UpToDate disk
block drbd5: state = { cs:WFConnection ro:Secondary/Unknown
ds:Diskless/DUnknown r--- }
block drbd5: wanted = { cs:WFConnection ro:Primary/Unknown
ds:Diskless/DUnknown r--- }
Then the status on node A was:
cc-manager-templates-ha Connected Primary/Secondary Diskless/UpToDate A
r----
When I tried to manually attach the device I got error messages:
"Split-Brain detected, dropping connection".
After some googling without finding any hint suddenly the status changed:
cc-manager-templates-ha StandAlone Primary/Unknown UpToDate/DUnknown
r---- xen-vbd: _cc-manager
So, suddenly this one device is not connected anymore. All the other
drbd devices are still connected and working fine - only this single
device is making problems, although it has identical configuration.
What could cause such an issue? Everything was working fine, I just
rebooted the servers.
Any hints what to do now to solve this issue?
thanks
Klaus