[DRBD-user] DRBD errors

Fri Jun 9 10:23:49 CEST 2017

Dear Igor,

Thank you for your reply.
Seems that i managed to sort it out.

In the end, i have created a script that was monitoring the logs and then
doing several checks when the logs were generated.
Turns out that the issue was memory-related.

Basically, seems that the IPoIB module needed n*64kb memory pages available
to work while there were none, because the customer had computational jobs
running on the head node.
I could see that with iperf, the BW was around 100kb/s.

i have enabled the vm.reclaim_zone variable, so it will be able to reclaim
the cached memory. It will probably impact on the performances, but at
least the DRBD is stable.

It's been 2 days without errors :)

Regards,

On 8 June 2017 at 02:00, Igor Cicimov <icicimov at gmail.com> wrote:

>
>
> On 8 Jun 2017 9:40 am, "Igor Cicimov" <icicimov at gmail.com> wrote:
>
>
>
> On 6 Jun 2017 7:23 pm, "Andrea del Monaco" <andrea.delmonaco at clustervisio
> n.com> wrote:
>
> Hello everybody,
>
> I am currently facing some issues with the DRBD syncronization.
> Here is the config file:
> global {
>         usage-count no;
> }
>
> common {
>         startup {
>                 wfc-timeout 15;
>                 degr-wfc-timeout 15;
>                 outdated-wfc-timeout 15;
>         }
>         disk {
>                 resync-rate 80M;
>                 disk-flushes no;
>                 disk-barrier no;
>                 al-extents 3389;
>                 c-fill-target 0;
>                 c-plan-ahead 18;
>                 c-max-rate 200M;
>         }
>         net {
>                 protocol C;
>                 max-buffers 8000;
>                 max-epoch-size 8000;
>                 sndbuf-size 1024k;
>         }
> }
>
> resource cmshareddrbdres {
>         net {
>                 cram-hmac-alg sha1;
>                 shared-secret xxxxxxx;
>                 after-sb-0pri discard-younger-primary;
>                 after-sb-1pri discard-secondary;
>                 csums-alg md5;
>         }
>         on master1 {
>                 device     /dev/drbd1;
>                 disk       /dev/sdb;
>                 address    10.149.255.254:7789;
>                 meta-disk  internal;
>         }
>         on master2 {
>                 device     /dev/drbd1;
>                 disk       /dev/sdb;
>                 address    10.149.255.253:7789;
>                 meta-disk  internal;
>         }
> }
>
> The network 10.149.0.0/16 is using IPoIB.
>
> The messages that i see are (first master): https://pastebin.com/0xCLceeD
>
> Suspect messages:
> [Sun Jun  4 03:50:17 2017] block drbd1: logical block size of local
> backend does not match (drbd:512, backend:4096); was this a late attach?
> [Sun Jun  4 03:51:01 2017] drbd cmshareddrbdres: [drbd_w_cmshared/3640]
> sock_sendmsg time expired, ko = 6
> [Sun Jun  4 03:34:12 2017] block drbd1: We did not send a P_BARRIER for
> 84203ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
> (I see so many of these)
>
> To me, i would say that there is some issue with the network, but i am not
> sure, because in that case i would expect drbd to be able to send the
> messages but going in timeout on the other side.
>
> I have tried to stress it and i couldn't reproduce it, so it doesn't seem
> to be load-related.
>
> [root at master1 ~]# uname -r
> 3.10.0-327.el7.x86_64
> [root at master1 ~]# rpm -qa | grep drbd
> kmod-drbd84-8.4.7-1_1.el7.elrepo.x86_64
> drbd84-utils-8.9.5-1.el7.elrepo.x86_64
>
> Any ideas?
>
>
> Regards,
> --
>
> [image: clustervision_logo.png]
> Andrea Del Monaco
> Internal Engineer
>
>
> Mob: +31 64 166 4003
> Skype: delmonaco.andrea
> andrea.delmonaco at clustervision.com
>
> ClusterVision BV
> Gyroscoopweg 56
> 1042 AC Amsterdam
> The Netherlands
> Tel: +31 20 407 7550 <+31%2020%20407%207550>
> Fax: +31 84 759 8389 <+31%2084%20759%208389>
> www.clustervision.com
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
> The ko-count thing from the log means the secondary fails to commit the
> writes in expected time frame which looks to me like backing device
> storage/driver/os issues rather than drbd. I would check if that works
> properly first if I was you.
>
> Then test the network speed (if you havent done so already) timeout of
> 7x6=42sec is way too high for infiniband for this kind of issues. Bu the
> way, there is a Linbit technical guide for ipoib which i hope you did
> consult.
>

-- 

[image: clustervision_logo.png]
Andrea Del Monaco
Internal Engineer

Mob: +31 64 166 4003
Skype: delmonaco.andrea
andrea.delmonaco at clustervision.com

ClusterVision BV
Gyroscoopweg 56
1042 AC Amsterdam
The Netherlands
Tel: +31 20 407 7550
Fax: +31 84 759 8389
www.clustervision.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170609/b37adf5d/attachment.htm>