Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Lars!
Am 07.12.2010 19:45, schrieb Lars Ellenberg:
> On Mon, Dec 06, 2010 at 06:08:19PM +0100, Klaus Darilion wrote:
>> Hi all!
>>
>> Today I had a strange experience.
>>
>> node A: 192.168.100.100, cc1-vie
>> /dev/drbd1: primary
>> /dev/drbd5: primary
>>
>> node B: 192.168.100.101, cc1-sbg
>> /dev/drbd1: secondary
>> /dev/drbd5: secondary
>>
>> The /dev/drbdX devices are used by a xen domU.
>>
>> resource manager-ha {
>> startup {
>> become-primary-on cc1-vie;
>> }
>> on cc1-vie {
>> device /dev/drbd1;
>> disk /dev/mapper/cc1--vienna-manager--disk--drbd;
>> address 192.168.100.100:7789;
>> meta-disk internal;
>> }
>> on cc1-sbg {
>> device /dev/drbd1;
>> disk /dev/mapper/cc1--sbg-manager--disk--drbd;
>> address 192.168.100.101:7789;
>> meta-disk internal;
>> }
>> }
>>
>> resource cc-manager-templates-ha {
>> startup {
>> become-primary-on cc1-vie;
>> }
>> on cc1-vie {
>> device /dev/drbd5;
>> disk /dev/mapper/cc1--vienna-cc--manager--templates--drbd
>> address 192.168.100.100:7793;
>> meta-disk internal;
>> }
>> on cc1-sbg {
>> device /dev/drbd5;
>> disk /dev/mapper/cc1--sbg-cc--manager--templates--drbd
>> address 192.168.100.101:7793;
>> meta-disk internal;
>> }
>> }
>>
>> Everything was running fine. Then I rebooted both servers. Then I spotted:
>>
>> block drbd5: Starting worker thread (from cqueue [1573])
>> block drbd5: disk( Diskless -> Attaching )
>> block drbd5: Found 4 transactions (192 active extents) in activity log.
>> block drbd5: Method to ensure write ordering: barrier
>> block drbd5: Backing device's merge_bvec_fn() = ffffffff81431b10
>> block drbd5: max_segment_size ( = BIO size ) = 4096
>> block drbd5: drbd_bm_resize called with capacity == 41941688
>> block drbd5: resync bitmap: bits=5242711 words=81918
>> block drbd5: size = 20 GB (20970844 KB)
>> block drbd5: recounting of set bits took additional 0 jiffies
>> block drbd5: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
>> block drbd5: Marked additional 508 MB as out-of-sync based on AL.
>> block drbd5: disk( Attaching -> UpToDate )
>>
>>
>> This is the first thing which makes me nervous: There were 500MB to
>> synchronize although the server was idle and everything was
>> synchronized before rebooting.
>
> As was pointed out already,
> read up on what we call the activitly log.
>
>> Then some more reboots on node A and suddenly:
>>
>> block drbd5: State change failed: Refusing to be Primary without at
>> least one UpToDate disk
>> block drbd5: state = { cs:WFConnection ro:Secondary/Unknown
>> ds:Diskless/DUnknown r--- }
> ^^^^^^^^
>
> You failed to attach, you have not yet connected,
> so DRBD refuses to become Primary: which data should it be Primary with?
but how can it be secondary without and disk?
>> Then the status on node A was:
>>
>> cc-manager-templates-ha Connected Primary/Secondary
>> Diskless/UpToDate A r----
>
> It was able to establish the connection,
> and was going Primary with the data of the peer.
Is this a feature? How can it know that the peers data is up2date when
it can not attach to the local disk?
>> When I tried to manually attach the device I got error messages:
>> "Split-Brain detected, dropping connection".
>
> Hm. Ugly.
> It should refuse the attach instead.
> Did it just get the error message wrong,
> or did it actually disconnect there?
> What DRBD version would that be?
Ubuntu 10.04:
# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
root at cc1-sbg, 2010-10-14 15:13:20
>
>> After some googling without finding any hint suddenly the status changed:
>>
>> cc-manager-templates-ha StandAlone Primary/Unknown
>> UpToDate/DUnknown r---- xen-vbd: _cc-manager
>>
>>
>> So, suddenly this one device is not connected anymore. All the other
>> drbd devices are still connected and working fine - only this single
>> device is making problems, although it has identical configuration.
>>
>>
>> What could cause such an issue? Everything was working fine, I just
>> rebooted the servers.
>>
>> Any hints what to do now to solve this issue?
>
> Your setup is broken.
> Apparently something in your boot process, at least "sometimes",
> claims the lower level devices so DRBD fails to attach.
> Fix that.
Almost done (see below)
> Your shutdown process is apparently broken enough to
> not really shutdown everything and demote/down DRBD
> so it stays Primary. That makes an "orderly" shutdown/reboot
> look like a Primary crash to DRBD.
> Fix that.
Done. DRDB was shut down before xendomains. Thus DRBD refused to shut
down as xen had the volumes still mounted. So I changed the order of the
symlinks in /etc/rcX.d/ for drbd.
>
> Are you sure that you have been the only one tampering with DRBD at the
> time, or would heartbeat/pacemaker/whatever try to do something at the
> same time?
no cluster managers - just me
> And, BTW, no.
> Your /etc/hosts file has zero to do with how DRBD behaves.
At least I can reproduce the bad behavior when adding the bug to
/etc/hosts. I think it has something todo how I address the disk. The
one volume which is working fine is configured with:
disk /dev/mapper/cc1--vienna-manager--disk--drbd
The other volume which causes the problems is configured with
disk /dev/cc1-vienna/cc-manager-templates-drbd
which is a symlink to
/dev/mapper/cc1--vienna-cc--manager--templates--drbd
So, I have no idea why, but it seems that if /etc/hosts is broken then
the symlinks are no available when DRBD starts. When after booting up is
stop/start the DRBD service, then DRBD attaches to the disks fine. Strange.
Thanks
Klaus