Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Lars! Am 07.12.2010 19:45, schrieb Lars Ellenberg: > On Mon, Dec 06, 2010 at 06:08:19PM +0100, Klaus Darilion wrote: >> Hi all! >> >> Today I had a strange experience. >> >> node A: 192.168.100.100, cc1-vie >> /dev/drbd1: primary >> /dev/drbd5: primary >> >> node B: 192.168.100.101, cc1-sbg >> /dev/drbd1: secondary >> /dev/drbd5: secondary >> >> The /dev/drbdX devices are used by a xen domU. >> >> resource manager-ha { >> startup { >> become-primary-on cc1-vie; >> } >> on cc1-vie { >> device /dev/drbd1; >> disk /dev/mapper/cc1--vienna-manager--disk--drbd; >> address 192.168.100.100:7789; >> meta-disk internal; >> } >> on cc1-sbg { >> device /dev/drbd1; >> disk /dev/mapper/cc1--sbg-manager--disk--drbd; >> address 192.168.100.101:7789; >> meta-disk internal; >> } >> } >> >> resource cc-manager-templates-ha { >> startup { >> become-primary-on cc1-vie; >> } >> on cc1-vie { >> device /dev/drbd5; >> disk /dev/mapper/cc1--vienna-cc--manager--templates--drbd >> address 192.168.100.100:7793; >> meta-disk internal; >> } >> on cc1-sbg { >> device /dev/drbd5; >> disk /dev/mapper/cc1--sbg-cc--manager--templates--drbd >> address 192.168.100.101:7793; >> meta-disk internal; >> } >> } >> >> Everything was running fine. Then I rebooted both servers. Then I spotted: >> >> block drbd5: Starting worker thread (from cqueue [1573]) >> block drbd5: disk( Diskless -> Attaching ) >> block drbd5: Found 4 transactions (192 active extents) in activity log. >> block drbd5: Method to ensure write ordering: barrier >> block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10 >> block drbd5: max_segment_size ( = BIO size ) = 4096 >> block drbd5: drbd_bm_resize called with capacity == 41941688 >> block drbd5: resync bitmap: bits=5242711 words=81918 >> block drbd5: size = 20 GB (20970844 KB) >> block drbd5: recounting of set bits took additional 0 jiffies >> block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map. >> block drbd5: Marked additional 508 MB as out-of-sync based on AL. >> block drbd5: disk( Attaching -> UpToDate ) >> >> >> This is the first thing which makes me nervous: There were 500MB to >> synchronize although the server was idle and everything was >> synchronized before rebooting. > > As was pointed out already, > read up on what we call the activitly log. > >> Then some more reboots on node A and suddenly: >> >> block drbd5: State change failed: Refusing to be Primary without at >> least one UpToDate disk >> block drbd5: state = { cs:WFConnection ro:Secondary/Unknown >> ds:Diskless/DUnknown r--- } > ^^^^^^^^ > > You failed to attach, you have not yet connected, > so DRBD refuses to become Primary: which data should it be Primary with? but how can it be secondary without and disk? >> Then the status on node A was: >> >> cc-manager-templates-ha Connected Primary/Secondary >> Diskless/UpToDate A r---- > > It was able to establish the connection, > and was going Primary with the data of the peer. Is this a feature? How can it know that the peers data is up2date when it can not attach to the local disk? >> When I tried to manually attach the device I got error messages: >> "Split-Brain detected, dropping connection". > > Hm. Ugly. > It should refuse the attach instead. > Did it just get the error message wrong, > or did it actually disconnect there? > What DRBD version would that be? Ubuntu 10.04: # /etc/init.d/drbd status drbd driver loaded OK; device status: version: 8.3.7 (api:88/proto:86-91) GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at cc1-sbg, 2010-10-14 15:13:20 > >> After some googling without finding any hint suddenly the status changed: >> >> cc-manager-templates-ha StandAlone Primary/Unknown >> UpToDate/DUnknown r---- xen-vbd: _cc-manager >> >> >> So, suddenly this one device is not connected anymore. All the other >> drbd devices are still connected and working fine - only this single >> device is making problems, although it has identical configuration. >> >> >> What could cause such an issue? Everything was working fine, I just >> rebooted the servers. >> >> Any hints what to do now to solve this issue? > > Your setup is broken. > Apparently something in your boot process, at least "sometimes", > claims the lower level devices so DRBD fails to attach. > Fix that. Almost done (see below) > Your shutdown process is apparently broken enough to > not really shutdown everything and demote/down DRBD > so it stays Primary. That makes an "orderly" shutdown/reboot > look like a Primary crash to DRBD. > Fix that. Done. DRDB was shut down before xendomains. Thus DRBD refused to shut down as xen had the volumes still mounted. So I changed the order of the symlinks in /etc/rcX.d/ for drbd. > > Are you sure that you have been the only one tampering with DRBD at the > time, or would heartbeat/pacemaker/whatever try to do something at the > same time? no cluster managers - just me > And, BTW, no. > Your /etc/hosts file has zero to do with how DRBD behaves. At least I can reproduce the bad behavior when adding the bug to /etc/hosts. I think it has something todo how I address the disk. The one volume which is working fine is configured with: disk /dev/mapper/cc1--vienna-manager--disk--drbd The other volume which causes the problems is configured with disk /dev/cc1-vienna/cc-manager-templates-drbd which is a symlink to /dev/mapper/cc1--vienna-cc--manager--templates--drbd So, I have no idea why, but it seems that if /etc/hosts is broken then the symlinks are no available when DRBD starts. When after booting up is stop/start the DRBD service, then DRBD attaches to the disks fine. Strange. Thanks Klaus