[DRBD-user] Possible IPoIB deadlock with DRBD

Eric Blevins ericlb100 at gmail.com
Fri Jan 16 16:26:20 CET 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I do not have any mlx4 cards so I cannot test the fix.


Eric

On Fri, Jan 16, 2015 at 10:16 AM, Matteo Tescione <matteo at rmnet.it> wrote:
> Yes I've seen the posts you suggested.
> Do you have it tested?
> we are using qib driver for qlogic 7342 adapters, don't know if someone ported to it too.
>
> Regards,
>
> --
> matteo
>
> ----- Messaggio originale -----
>> Da: "Eric Blevins" <ericlb100 at gmail.com>
>> A: "Matteo Tescione" <matteo at rmnet.it>
>> Cc: drbd-user at lists.linbit.com
>> Inviato: Venerdì, 16 gennaio 2015 15:50:17
>> Oggetto: Re: [DRBD-user] Possible IPoIB deadlock with DRBD
>>
>> The split brain would only happen on dual primary.
>>
>> We have Mellanox MHEA28-XTC using mthca driver.
>>
>> The potential IPoIB deadlock is only fixed in the mlx4 driver so far.
>>
>>
>>
>> common {
>>   net {
>>     connect-int 20; #Default 10 units 1
>>     timeout 180; #default 60 units .1
>>     ping-int 30; #default 10 units 1
>>     ping-timeout 10; #default 5 units .1
>>     ko-count 20;
>>     max-buffers 16000;
>>     max-epoch-size 16000;
>>     sndbuf-size 0;
>>     rcvbuf-size 0;
>>     unplug-watermark 16001;
>>     verify-alg md5;
>>   }
>>   disk {
>>     c-plan-ahead 10;
>>     c-min-rate 30M;
>>     c-max-rate 200M;
>>     c-fill-target 20M;
>>     al-extents 3389;
>>     md-flushes no;
>>     disk-barrier no;
>>     disk-flushes no;
>>   }
>> }
>> resource drbd0 {
>>   device /dev/drbd0;
>>   disk /dev/sdc;
>>   meta-disk internal;
>>   startup {
>>     wfc-timeout  120;
>>     degr-wfc-timeout 60;
>>     outdated-wfc-timeout 60;
>>     become-primary-on both;
>>   }
>>   disk {
>>     c-max-rate 200M;
>>     c-min-rate 30M;
>>     c-fill-target 20M;
>>     c-plan-ahead 10;
>>
>>   }
>>   net {
>>     protocol C;
>>     cram-hmac-alg sha1;
>>     shared-secret "XXXXXXXXXXXXX";
>>     allow-two-primaries;
>>     after-sb-0pri discard-zero-changes;
>>     after-sb-1pri discard-secondary;
>>     after-sb-2pri disconnect;
>>   }
>>   on vm1 {
>>     address x.x.x.1:7788;
>>   }
>>   on vm2 {
>>     address x.x.x.2:7788;
>>   }
>> }
>>
>> On Fri, Jan 16, 2015 at 5:32 AM, Matteo Tescione <matteo at rmnet.it>
>> wrote:
>> > Hi Eric,
>> >
>> > it seems that I'm hitting the same deadlock, but I don't use dual
>> > primary, and the split brain never occurs.
>> >
>> > Can you post your drbd config as long with the infiniband hba model
>> > and version you're using?
>> >
>> > regards,
>> >
>> > --
>> > matteo
>> >
>> > ----- Messaggio originale -----
>> >> Da: "Eric Blevins" <ericlb100 at gmail.com>
>> >> A: drbd-user at lists.linbit.com
>> >> Inviato: Giovedì, 15 gennaio 2015 17:53:48
>> >> Oggetto: [DRBD-user] Possible IPoIB deadlock with DRBD
>> >>
>> >> We are using Proxmox with DRBD in dual primary using IPoIB for
>> >> transport
>> >> Recently tested Proxmox upcoming 3.10 kernel based on the kernel
>> >> from
>> >> RHEL 7 and started having problems with DRBD.
>> >>
>> >> The kernel came with DRBD 8.4.3, I have also compiled and
>> >> installed
>> >> 8.4.5 and both experience the same problem.
>> >>
>> >> During times of heavy IO loads (backups) DRBD will timeout and
>> >> split
>> >> brain, I have included some logs below.
>> >> I stumbled on a couple LKML threads that discusses a deadlock with
>> >> IPoIB and IO that happens over the IPoIB such as iSCSI or NFS.
>> >> https://lkml.org/lkml/2014/2/21/655
>> >> http://lkml.org/lkml/2014/4/24/543
>> >>
>> >> Is it likely that DRBD could also trigger the deadlock discussed
>> >> on
>> >> LKML?
>> >> If not, do you have any other suggestions on how I can prevent
>> >> this
>> >> timeout?
>> >>
>> >>
>> >> Node A:
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.335766] drbd drbd0: peer(
>> >> Primary
>> >> -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate ->
>> >> DUnknown
>> >> )
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.335782] drbd drbd0: asender
>> >> terminated
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.335784] drbd drbd0:
>> >> Terminating
>> >> drbd_a_drbd0
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.335846] block drbd0: new
>> >> current
>> >> UUID
>> >> BD9DB97EC672F5C9:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.347788] drbd drbd0:
>> >> Connection
>> >> closed
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.347834] drbd drbd0: conn(
>> >> Timeout
>> >> -> Unconnected )
>> >> Jan  5 03:23:51 vm6 kernel: [2221944.347836] drbd drbd0: receiver
>> >> terminated
>> >>
>> >>
>> >> Node B:
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.170391] drbd drbd0: sock was
>> >> shut
>> >> down by peer
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.170409] drbd drbd0: peer(
>> >> Primary
>> >> -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
>> >> DUnknown )
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.170412] drbd drbd0: short
>> >> read
>> >> (expected size 16)
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.170421] drbd drbd0: asender
>> >> terminated
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.170423] drbd drbd0:
>> >> Terminating
>> >> drbd_a_drbd0
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.170480] block drbd0: new
>> >> current
>> >> UUID
>> >> 2628F73F9DAE5EDF:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.185536] drbd drbd0:
>> >> Connection
>> >> closed
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.185585] drbd drbd0: conn(
>> >> BrokenPipe -> Unconnected )
>> >> Jan  5 03:23:51 vm5 kernel: [2223090.185587] drbd drbd0: receiver
>> >> terminated
>> >>
>> >> Eric
>> >> _______________________________________________
>> >> drbd-user mailing list
>> >> drbd-user at lists.linbit.com
>> >> http://lists.linbit.com/mailman/listinfo/drbd-user
>> >>
>> >>
>> >> --
>> >> This message has been scanned for viruses and dangerous content by
>> >> RMnet MailScanner, and is believed to be clean.
>> >>
>> >> Click here to report this message as spam.
>> >> http://efa1.rmnet.it/cgi-bin/learn-msg.cgi?id=4C1D868B16.A88D5&token=94b3a0f1dfd9db46184ad15228603c27
>> >>
>> >>
>>
>>
>> --
>> This message has been scanned for viruses and dangerous content by
>> E.F.A. Project, and is believed to be clean.
>>
>> Click here to report this message as spam.
>> http://efa2.rmnet.it/cgi-bin/learn-msg.cgi?id=56D2360055.A4FCF&token=6f1b16f22f5f99bcc8213a40ef7ce29d
>>
>>



More information about the drbd-user mailing list