[DRBD-user] Possible IPoIB deadlock with DRBD

Eric Blevins ericlb100 at gmail.com
Fri Jan 16 15:50:17 CET 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


The split brain would only happen on dual primary.

We have Mellanox MHEA28-XTC using mthca driver.

The potential IPoIB deadlock is only fixed in the mlx4 driver so far.



common {
  net {
    connect-int 20; #Default 10 units 1
    timeout 180; #default 60 units .1
    ping-int 30; #default 10 units 1
    ping-timeout 10; #default 5 units .1
    ko-count 20;
    max-buffers 16000;
    max-epoch-size 16000;
    sndbuf-size 0;
    rcvbuf-size 0;
    unplug-watermark 16001;
    verify-alg md5;
  }
  disk {
    c-plan-ahead 10;
    c-min-rate 30M;
    c-max-rate 200M;
    c-fill-target 20M;
    al-extents 3389;
    md-flushes no;
    disk-barrier no;
    disk-flushes no;
  }
}
resource drbd0 {
  device /dev/drbd0;
  disk /dev/sdc;
  meta-disk internal;
  startup {
    wfc-timeout  120;
    degr-wfc-timeout 60;
    outdated-wfc-timeout 60;
    become-primary-on both;
  }
  disk {
    c-max-rate 200M;
    c-min-rate 30M;
    c-fill-target 20M;
    c-plan-ahead 10;

  }
  net {
    protocol C;
    cram-hmac-alg sha1;
    shared-secret "XXXXXXXXXXXXX";
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
  on vm1 {
    address x.x.x.1:7788;
  }
  on vm2 {
    address x.x.x.2:7788;
  }
}

On Fri, Jan 16, 2015 at 5:32 AM, Matteo Tescione <matteo at rmnet.it> wrote:
> Hi Eric,
>
> it seems that I'm hitting the same deadlock, but I don't use dual primary, and the split brain never occurs.
>
> Can you post your drbd config as long with the infiniband hba model and version you're using?
>
> regards,
>
> --
> matteo
>
> ----- Messaggio originale -----
>> Da: "Eric Blevins" <ericlb100 at gmail.com>
>> A: drbd-user at lists.linbit.com
>> Inviato: Giovedì, 15 gennaio 2015 17:53:48
>> Oggetto: [DRBD-user] Possible IPoIB deadlock with DRBD
>>
>> We are using Proxmox with DRBD in dual primary using IPoIB for
>> transport
>> Recently tested Proxmox upcoming 3.10 kernel based on the kernel from
>> RHEL 7 and started having problems with DRBD.
>>
>> The kernel came with DRBD 8.4.3, I have also compiled and installed
>> 8.4.5 and both experience the same problem.
>>
>> During times of heavy IO loads (backups) DRBD will timeout and split
>> brain, I have included some logs below.
>> I stumbled on a couple LKML threads that discusses a deadlock with
>> IPoIB and IO that happens over the IPoIB such as iSCSI or NFS.
>> https://lkml.org/lkml/2014/2/21/655
>> http://lkml.org/lkml/2014/4/24/543
>>
>> Is it likely that DRBD could also trigger the deadlock discussed on
>> LKML?
>> If not, do you have any other suggestions on how I can prevent this
>> timeout?
>>
>>
>> Node A:
>> Jan  5 03:23:51 vm6 kernel: [2221944.335766] drbd drbd0: peer(
>> Primary
>> -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown
>> )
>> Jan  5 03:23:51 vm6 kernel: [2221944.335782] drbd drbd0: asender
>> terminated
>> Jan  5 03:23:51 vm6 kernel: [2221944.335784] drbd drbd0: Terminating
>> drbd_a_drbd0
>> Jan  5 03:23:51 vm6 kernel: [2221944.335846] block drbd0: new current
>> UUID
>> BD9DB97EC672F5C9:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D
>> Jan  5 03:23:51 vm6 kernel: [2221944.347788] drbd drbd0: Connection
>> closed
>> Jan  5 03:23:51 vm6 kernel: [2221944.347834] drbd drbd0: conn(
>> Timeout
>> -> Unconnected )
>> Jan  5 03:23:51 vm6 kernel: [2221944.347836] drbd drbd0: receiver
>> terminated
>>
>>
>> Node B:
>> Jan  5 03:23:51 vm5 kernel: [2223090.170391] drbd drbd0: sock was
>> shut
>> down by peer
>> Jan  5 03:23:51 vm5 kernel: [2223090.170409] drbd drbd0: peer(
>> Primary
>> -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
>> DUnknown )
>> Jan  5 03:23:51 vm5 kernel: [2223090.170412] drbd drbd0: short read
>> (expected size 16)
>> Jan  5 03:23:51 vm5 kernel: [2223090.170421] drbd drbd0: asender
>> terminated
>> Jan  5 03:23:51 vm5 kernel: [2223090.170423] drbd drbd0: Terminating
>> drbd_a_drbd0
>> Jan  5 03:23:51 vm5 kernel: [2223090.170480] block drbd0: new current
>> UUID
>> 2628F73F9DAE5EDF:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D
>> Jan  5 03:23:51 vm5 kernel: [2223090.185536] drbd drbd0: Connection
>> closed
>> Jan  5 03:23:51 vm5 kernel: [2223090.185585] drbd drbd0: conn(
>> BrokenPipe -> Unconnected )
>> Jan  5 03:23:51 vm5 kernel: [2223090.185587] drbd drbd0: receiver
>> terminated
>>
>> Eric
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>> --
>> This message has been scanned for viruses and dangerous content by
>> RMnet MailScanner, and is believed to be clean.
>>
>> Click here to report this message as spam.
>> http://efa1.rmnet.it/cgi-bin/learn-msg.cgi?id=4C1D868B16.A88D5&token=94b3a0f1dfd9db46184ad15228603c27
>>
>>



More information about the drbd-user mailing list