<div dir="ltr"><font face="monospace">Good Morning,</font><div><font face="monospace"><br></font></div><div><font face="monospace">When drbd syncs a few resources everything works fine. But when drbd needs to sync all resources (i.e. a host came back up) it hangs the business app running above.<br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">All our configuration drbd settings are default, this is a resource sample:</font></div><div><font face="monospace"><br></font></div><div><font face="monospace">resource &quot;vm-100-disk-3&quot; {<br>    options {<br>        cpu-mask                &quot;&quot;; # default<br>        on-no-data-accessible   io-error; # default<br>        auto-promote            yes; # default<br>        peer-ack-window         4096s; # bytes, default<br>        peer-ack-delay          100; # milliseconds, default<br>        twopc-timeout           300; # 1/10 seconds, default<br>        twopc-retry-timeout     1; # 1/10 seconds, default<br>        auto-promote-timeout    20; # 1/10 seconds, default<br>        max-io-depth            8000; # default<br>        quorum                  majority;<br>        on-no-quorum            io-error;<br>        quorum-minimum-redundancy       off; # default<br>        on-suspended-primary-outdated   disconnect; # default<br>    }<br>    _this_host {<br>        node-id                 0;<br>        volume 0 {<br>            device                      minor 1017;<br>            disk                        &quot;/dev/vgthc1/vm-100-disk-3_00000&quot;;<br>            meta-disk                   internal;<br>            disk {<br>                size                    0s; # bytes, default<br>                on-io-error             detach; # default<br>                disk-barrier            no; # default<br>                disk-flushes            yes; # default<br>                disk-drain              yes; # default<br>                md-flushes              yes; # default<br>                resync-after            -1; # default<br>                al-extents              1237; # default<br>                al-updates              yes; # default<br>                discard-zeroes-if-aligned       yes; # default<br>                disable-write-same      no; # default<br>                disk-timeout            0; # 1/10 seconds, default<br>                read-balancing          prefer-local; # default<br>                rs-discard-granularity  1048576; # bytes<br>            }<br>        }<br>    }<br>    connection {<br>        _peer_node_id 2;<br>        path {<br>            _this_host ipv4 <a href="http://10.0.7.106:7017">10.0.7.106:7017</a>;<br>            _remote_host ipv4 <a href="http://10.100.1.3:7017">10.100.1.3:7017</a>;<br>        }<br>        net {<br>            transport           &quot;&quot;; # default<br>            protocol            C; # default<br>            timeout             60; # 1/10 seconds, default<br>            max-epoch-size      2048; # default<br>            connect-int         10; # seconds, default<br>            ping-int            10; # seconds, default<br>            sndbuf-size         0; # bytes, default<br>            rcvbuf-size         0; # bytes, default<br>            ko-count            7; # default<br>            allow-two-primaries no; # default<br>            cram-hmac-alg       &quot;sha1&quot;;<br>            shared-secret       &quot;*&quot;;<br>            after-sb-0pri       disconnect; # default<br>            after-sb-1pri       disconnect; # default<br>            after-sb-2pri       disconnect; # default<br>            always-asbp         no; # default<br>            rr-conflict         disconnect; # default<br>            ping-timeout        5; # 1/10 seconds, default<br>            data-integrity-alg  &quot;&quot;; # default<br>            tcp-cork            yes; # default<br>            on-congestion       block; # default<br>            congestion-fill     0s; # bytes, default<br>            congestion-extents  1237; # default<br>            csums-alg           &quot;&quot;; # default<br>            csums-after-crash-only      no; # default<br>            verify-alg          &quot;crct10dif-pclmul&quot;;<br>            use-rle             yes; # default<br>            socket-check-timeout        0; # default<br>            fencing             dont-care; # default<br>            max-buffers         2048; # default<br>            allow-remote-read   yes; # default<br>            _name               &quot;C&quot;;<br>        }<br>        volume 0 {<br>            disk {<br>                resync-rate             250k; # bytes/second, default<br>                c-plan-ahead            20; # 1/10 seconds, default<br>                c-delay-target          10; # 1/10 seconds, default<br>                c-fill-target           100s; # bytes, default<br>                c-max-rate              102400k; # bytes/second, default<br>                c-min-rate              250k; # bytes/second, default<br>                bitmap                  no;<br>            }<br>        }<br>    }<br>    connection {<br>        _peer_node_id 1;<br>        path {<br>            _this_host ipv4 <a href="http://10.0.7.106:7017">10.0.7.106:7017</a>;<br>            _remote_host ipv4 <a href="http://10.0.7.105:7017">10.0.7.105:7017</a>;<br>        }<br>        net {<br>            transport           &quot;&quot;; # default<br>            protocol            C; # default<br>            timeout             60; # 1/10 seconds, default<br>            max-epoch-size      2048; # default<br>            connect-int         10; # seconds, default<br>            ping-int            10; # seconds, default<br>            sndbuf-size         0; # bytes, default<br>            rcvbuf-size         0; # bytes, default<br>            ko-count            7; # default<br>            allow-two-primaries no; # default<br>            cram-hmac-alg       &quot;sha1&quot;;<br>            shared-secret       &quot;*&quot;;<br>            after-sb-0pri       disconnect; # default<br>            after-sb-1pri       disconnect; # default<br>            after-sb-2pri       disconnect; # default<br>            always-asbp         no; # default<br>            rr-conflict         disconnect; # default<br>            ping-timeout        5; # 1/10 seconds, default<br>            data-integrity-alg  &quot;&quot;; # default<br>            tcp-cork            yes; # default<br>            on-congestion       block; # default<br>            congestion-fill     0s; # bytes, default<br>            congestion-extents  1237; # default<br>            csums-alg           &quot;&quot;; # default<br>            csums-after-crash-only      no; # default<br>            verify-alg          &quot;crct10dif-pclmul&quot;;<br>            use-rle             yes; # default<br>            socket-check-timeout        0; # default<br>            fencing             dont-care; # default<br>            max-buffers         2048; # default<br>            allow-remote-read   yes; # default<br>            _name               &quot;T&quot;;<br>        }<br>        volume 0 {<br>            disk {<br>                resync-rate             250k; # bytes/second, default<br>                c-plan-ahead            20; # 1/10 seconds, default<br>                c-delay-target          10; # 1/10 seconds, default<br>                c-fill-target           100s; # bytes, default<br>                c-max-rate              102400k; # bytes/second, default<br>                c-min-rate              250k; # bytes/second, default<br>                bitmap                  yes; # default<br>            }<br>        }<br>    }<br>}</font><br></div><div><font face="monospace"><br></font></div><div><span style="font-family:monospace">We have 39 defined </span>resoruces<span style="font-family:monospace"> using the same settings. And all these resources are running on the same RAID supported by two physical nvme ssd drives. </span></div><div><span style="font-family:monospace">We have two combined hosts and a diskless satellite host. </span><span style="font-family:monospace">The network card between the two hosts is a 1Gb card.</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">I have read the following guide </span><font face="monospace"><a href="https://kb.linbit.com/tuning-drbds-resync-controller">https://kb.linbit.com/tuning-drbds-resync-controller</a> and I think our current installation might have to be tuned in order to avoid those application hungs.</font></div><div><font face="monospace"><br></font></div><div><font face="monospace">I think that I have to tune the c-max-rate for all the devices but I don&#39;t know it for sure. Do I have a way to limit the whole c-max-rate globally? Or do I have to limit it for every resource so that when they sum up they don&#39;t exceed our current physical limitations?</font></div><div><font face="monospace"><br></font></div><div><font face="monospace">I&#39;ve seen a global_common configuration but I don&#39;t know if it is meant to be a global conf for the whole drbd system or a conf applied to all defined resources individually.</font></div><div><font face="monospace"><br></font></div><div><font face="monospace">If anyone can guide me through this I&#39;ll be grateful. Thanks and regards,</font></div><div><font face="monospace">Ferran </font></div><div><br></div><div><font face="monospace"><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace"><br></font></div></div>