Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I do not have any mlx4 cards so I cannot test the fix. Eric On Fri, Jan 16, 2015 at 10:16 AM, Matteo Tescione <matteo at rmnet.it> wrote: > Yes I've seen the posts you suggested. > Do you have it tested? > we are using qib driver for qlogic 7342 adapters, don't know if someone ported to it too. > > Regards, > > -- > matteo > > ----- Messaggio originale ----- >> Da: "Eric Blevins" <ericlb100 at gmail.com> >> A: "Matteo Tescione" <matteo at rmnet.it> >> Cc: drbd-user at lists.linbit.com >> Inviato: Venerdì, 16 gennaio 2015 15:50:17 >> Oggetto: Re: [DRBD-user] Possible IPoIB deadlock with DRBD >> >> The split brain would only happen on dual primary. >> >> We have Mellanox MHEA28-XTC using mthca driver. >> >> The potential IPoIB deadlock is only fixed in the mlx4 driver so far. >> >> >> >> common { >> net { >> connect-int 20; #Default 10 units 1 >> timeout 180; #default 60 units .1 >> ping-int 30; #default 10 units 1 >> ping-timeout 10; #default 5 units .1 >> ko-count 20; >> max-buffers 16000; >> max-epoch-size 16000; >> sndbuf-size 0; >> rcvbuf-size 0; >> unplug-watermark 16001; >> verify-alg md5; >> } >> disk { >> c-plan-ahead 10; >> c-min-rate 30M; >> c-max-rate 200M; >> c-fill-target 20M; >> al-extents 3389; >> md-flushes no; >> disk-barrier no; >> disk-flushes no; >> } >> } >> resource drbd0 { >> device /dev/drbd0; >> disk /dev/sdc; >> meta-disk internal; >> startup { >> wfc-timeout 120; >> degr-wfc-timeout 60; >> outdated-wfc-timeout 60; >> become-primary-on both; >> } >> disk { >> c-max-rate 200M; >> c-min-rate 30M; >> c-fill-target 20M; >> c-plan-ahead 10; >> >> } >> net { >> protocol C; >> cram-hmac-alg sha1; >> shared-secret "XXXXXXXXXXXXX"; >> allow-two-primaries; >> after-sb-0pri discard-zero-changes; >> after-sb-1pri discard-secondary; >> after-sb-2pri disconnect; >> } >> on vm1 { >> address x.x.x.1:7788; >> } >> on vm2 { >> address x.x.x.2:7788; >> } >> } >> >> On Fri, Jan 16, 2015 at 5:32 AM, Matteo Tescione <matteo at rmnet.it> >> wrote: >> > Hi Eric, >> > >> > it seems that I'm hitting the same deadlock, but I don't use dual >> > primary, and the split brain never occurs. >> > >> > Can you post your drbd config as long with the infiniband hba model >> > and version you're using? >> > >> > regards, >> > >> > -- >> > matteo >> > >> > ----- Messaggio originale ----- >> >> Da: "Eric Blevins" <ericlb100 at gmail.com> >> >> A: drbd-user at lists.linbit.com >> >> Inviato: Giovedì, 15 gennaio 2015 17:53:48 >> >> Oggetto: [DRBD-user] Possible IPoIB deadlock with DRBD >> >> >> >> We are using Proxmox with DRBD in dual primary using IPoIB for >> >> transport >> >> Recently tested Proxmox upcoming 3.10 kernel based on the kernel >> >> from >> >> RHEL 7 and started having problems with DRBD. >> >> >> >> The kernel came with DRBD 8.4.3, I have also compiled and >> >> installed >> >> 8.4.5 and both experience the same problem. >> >> >> >> During times of heavy IO loads (backups) DRBD will timeout and >> >> split >> >> brain, I have included some logs below. >> >> I stumbled on a couple LKML threads that discusses a deadlock with >> >> IPoIB and IO that happens over the IPoIB such as iSCSI or NFS. >> >> https://lkml.org/lkml/2014/2/21/655 >> >> http://lkml.org/lkml/2014/4/24/543 >> >> >> >> Is it likely that DRBD could also trigger the deadlock discussed >> >> on >> >> LKML? >> >> If not, do you have any other suggestions on how I can prevent >> >> this >> >> timeout? >> >> >> >> >> >> Node A: >> >> Jan 5 03:23:51 vm6 kernel: [2221944.335766] drbd drbd0: peer( >> >> Primary >> >> -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> >> >> DUnknown >> >> ) >> >> Jan 5 03:23:51 vm6 kernel: [2221944.335782] drbd drbd0: asender >> >> terminated >> >> Jan 5 03:23:51 vm6 kernel: [2221944.335784] drbd drbd0: >> >> Terminating >> >> drbd_a_drbd0 >> >> Jan 5 03:23:51 vm6 kernel: [2221944.335846] block drbd0: new >> >> current >> >> UUID >> >> BD9DB97EC672F5C9:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D >> >> Jan 5 03:23:51 vm6 kernel: [2221944.347788] drbd drbd0: >> >> Connection >> >> closed >> >> Jan 5 03:23:51 vm6 kernel: [2221944.347834] drbd drbd0: conn( >> >> Timeout >> >> -> Unconnected ) >> >> Jan 5 03:23:51 vm6 kernel: [2221944.347836] drbd drbd0: receiver >> >> terminated >> >> >> >> >> >> Node B: >> >> Jan 5 03:23:51 vm5 kernel: [2223090.170391] drbd drbd0: sock was >> >> shut >> >> down by peer >> >> Jan 5 03:23:51 vm5 kernel: [2223090.170409] drbd drbd0: peer( >> >> Primary >> >> -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> >> >> DUnknown ) >> >> Jan 5 03:23:51 vm5 kernel: [2223090.170412] drbd drbd0: short >> >> read >> >> (expected size 16) >> >> Jan 5 03:23:51 vm5 kernel: [2223090.170421] drbd drbd0: asender >> >> terminated >> >> Jan 5 03:23:51 vm5 kernel: [2223090.170423] drbd drbd0: >> >> Terminating >> >> drbd_a_drbd0 >> >> Jan 5 03:23:51 vm5 kernel: [2223090.170480] block drbd0: new >> >> current >> >> UUID >> >> 2628F73F9DAE5EDF:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D >> >> Jan 5 03:23:51 vm5 kernel: [2223090.185536] drbd drbd0: >> >> Connection >> >> closed >> >> Jan 5 03:23:51 vm5 kernel: [2223090.185585] drbd drbd0: conn( >> >> BrokenPipe -> Unconnected ) >> >> Jan 5 03:23:51 vm5 kernel: [2223090.185587] drbd drbd0: receiver >> >> terminated >> >> >> >> Eric >> >> _______________________________________________ >> >> drbd-user mailing list >> >> drbd-user at lists.linbit.com >> >> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> >> >> >> >> -- >> >> This message has been scanned for viruses and dangerous content by >> >> RMnet MailScanner, and is believed to be clean. >> >> >> >> Click here to report this message as spam. >> >> http://efa1.rmnet.it/cgi-bin/learn-msg.cgi?id=4C1D868B16.A88D5&token=94b3a0f1dfd9db46184ad15228603c27 >> >> >> >> >> >> >> -- >> This message has been scanned for viruses and dangerous content by >> E.F.A. Project, and is believed to be clean. >> >> Click here to report this message as spam. >> http://efa2.rmnet.it/cgi-bin/learn-msg.cgi?id=56D2360055.A4FCF&token=6f1b16f22f5f99bcc8213a40ef7ce29d >> >>