Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear Igor, Thank you for your reply. Seems that i managed to sort it out. In the end, i have created a script that was monitoring the logs and then doing several checks when the logs were generated. Turns out that the issue was memory-related. Basically, seems that the IPoIB module needed n*64kb memory pages available to work while there were none, because the customer had computational jobs running on the head node. I could see that with iperf, the BW was around 100kb/s. i have enabled the vm.reclaim_zone variable, so it will be able to reclaim the cached memory. It will probably impact on the performances, but at least the DRBD is stable. It's been 2 days without errors :) Regards, On 8 June 2017 at 02:00, Igor Cicimov <icicimov at gmail.com> wrote: > > > On 8 Jun 2017 9:40 am, "Igor Cicimov" <icicimov at gmail.com> wrote: > > > > On 6 Jun 2017 7:23 pm, "Andrea del Monaco" <andrea.delmonaco at clustervisio > n.com> wrote: > > Hello everybody, > > I am currently facing some issues with the DRBD syncronization. > Here is the config file: > global { > usage-count no; > } > > common { > startup { > wfc-timeout 15; > degr-wfc-timeout 15; > outdated-wfc-timeout 15; > } > disk { > resync-rate 80M; > disk-flushes no; > disk-barrier no; > al-extents 3389; > c-fill-target 0; > c-plan-ahead 18; > c-max-rate 200M; > } > net { > protocol C; > max-buffers 8000; > max-epoch-size 8000; > sndbuf-size 1024k; > } > } > > resource cmshareddrbdres { > net { > cram-hmac-alg sha1; > shared-secret xxxxxxx; > after-sb-0pri discard-younger-primary; > after-sb-1pri discard-secondary; > csums-alg md5; > } > on master1 { > device /dev/drbd1; > disk /dev/sdb; > address 10.149.255.254:7789; > meta-disk internal; > } > on master2 { > device /dev/drbd1; > disk /dev/sdb; > address 10.149.255.253:7789; > meta-disk internal; > } > } > > The network 10.149.0.0/16 is using IPoIB. > > The messages that i see are (first master): https://pastebin.com/0xCLceeD > > Suspect messages: > [Sun Jun 4 03:50:17 2017] block drbd1: logical block size of local > backend does not match (drbd:512, backend:4096); was this a late attach? > [Sun Jun 4 03:51:01 2017] drbd cmshareddrbdres: [drbd_w_cmshared/3640] > sock_sendmsg time expired, ko = 6 > [Sun Jun 4 03:34:12 2017] block drbd1: We did not send a P_BARRIER for > 84203ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked? > (I see so many of these) > > To me, i would say that there is some issue with the network, but i am not > sure, because in that case i would expect drbd to be able to send the > messages but going in timeout on the other side. > > I have tried to stress it and i couldn't reproduce it, so it doesn't seem > to be load-related. > > [root at master1 ~]# uname -r > 3.10.0-327.el7.x86_64 > [root at master1 ~]# rpm -qa | grep drbd > kmod-drbd84-8.4.7-1_1.el7.elrepo.x86_64 > drbd84-utils-8.9.5-1.el7.elrepo.x86_64 > > Any ideas? > > > Regards, > -- > > [image: clustervision_logo.png] > Andrea Del Monaco > Internal Engineer > > > Mob: +31 64 166 4003 > Skype: delmonaco.andrea > andrea.delmonaco at clustervision.com > > ClusterVision BV > Gyroscoopweg 56 > 1042 AC Amsterdam > The Netherlands > Tel: +31 20 407 7550 <+31%2020%20407%207550> > Fax: +31 84 759 8389 <+31%2084%20759%208389> > www.clustervision.com > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > > The ko-count thing from the log means the secondary fails to commit the > writes in expected time frame which looks to me like backing device > storage/driver/os issues rather than drbd. I would check if that works > properly first if I was you. > > Then test the network speed (if you havent done so already) timeout of > 7x6=42sec is way too high for infiniband for this kind of issues. Bu the > way, there is a Linbit technical guide for ipoib which i hope you did > consult. > -- [image: clustervision_logo.png] Andrea Del Monaco Internal Engineer Mob: +31 64 166 4003 Skype: delmonaco.andrea andrea.delmonaco at clustervision.com ClusterVision BV Gyroscoopweg 56 1042 AC Amsterdam The Netherlands Tel: +31 20 407 7550 Fax: +31 84 759 8389 www.clustervision.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170609/b37adf5d/attachment.htm>