<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:#0563C1;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:#954F72;
        text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
        {mso-style-priority:99;
        mso-style-link:"Nur Text Zchn";
        margin:0cm;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
span.E-MailFormatvorlage17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
span.NurTextZchn
        {mso-style-name:"Nur Text Zchn";
        mso-style-priority:99;
        mso-style-link:"Nur Text";
        font-family:"Calibri",sans-serif;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="DE-AT" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoPlainText">Hi list,<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">we have sporadically problems with drbd 8.4.11 on 4 productive two-node proxmox 7 clusters.<o:p></o:p></p>
<p class="MsoPlainText">We use drbd in dual primary mode and the virtual server on both nodes have their virtio hard disks in the same drbd storage.<o:p></o:p></p>
<p class="MsoPlainText">Sometimes twice a week till somtimes once in three months, sometimes during the day and sometimes in the middle of the night the drbd connection terminates.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">All 4 clusters are direct attached with 10Gb NIC connection (some with bonding in active-backup mode and some without bonding).<o:p></o:p></p>
<p class="MsoPlainText">The Hardware including NIC, RAID controller and hard disks are different on the 4 clusters and equal on the respective two nodes.<o:p></o:p></p>
<p class="MsoPlainText">Processor, RAM and Disk Utilization are the same as always when the problem occurs (according to our monitoring system).<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Below is the configuration used on one of the affected clusters, which is the same we are using on older systems for a long time without any problems (it is only slightly modified from cluster to cluster depending on hard disk and network
or for tests to solve this problem).<o:p></o:p></p>
<p class="MsoPlainText">The kernel version on this cluster nodes ist 5.4.157-1-pve.<o:p></o:p></p>
<p class="MsoPlainText">The problem occours sometimes on the sas resource and sometimes on the ssd resource but not on both at the same time and also on clusters with only one ssd resource.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">/etc/drbd.d/global_common.conf:<o:p></o:p></p>
<p class="MsoPlainText">common { syncer { rate 1024M; verify-alg md5; } }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">/etc/drbd.d/sas.res:<o:p></o:p></p>
<p class="MsoPlainText">resource sas {<o:p></o:p></p>
<p class="MsoPlainText"> protocol C;<o:p></o:p></p>
<p class="MsoPlainText"> startup {<o:p></o:p></p>
<p class="MsoPlainText"> wfc-timeout 0;<o:p></o:p></p>
<p class="MsoPlainText"> degr-wfc-timeout 60;<o:p></o:p></p>
<p class="MsoPlainText"> become-primary-on both;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"> disk {<o:p></o:p></p>
<p class="MsoPlainText"> c-fill-target 24M; # 10M<o:p></o:p></p>
<p class="MsoPlainText"> c-max-rate 700M;<o:p></o:p></p>
<p class="MsoPlainText"> c-plan-ahead 0; # 7<o:p></o:p></p>
<p class="MsoPlainText"> c-min-rate 4M;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> net {<o:p></o:p></p>
<p class="MsoPlainText"> max-buffers 36864;<o:p></o:p></p>
<p class="MsoPlainText"> sndbuf-size 1048576; # bytes<o:p></o:p></p>
<p class="MsoPlainText"> rcvbuf-size 2097152; # bytes<o:p></o:p></p>
<p class="MsoPlainText"> allow-two-primaries yes;<o:p></o:p></p>
<p class="MsoPlainText"> cram-hmac-alg "sha1";<o:p></o:p></p>
<p class="MsoPlainText"> shared-secret "<SECRET-REMOVED>";<o:p></o:p></p>
<p class="MsoPlainText"> after-sb-0pri discard-zero-changes;<o:p></o:p></p>
<p class="MsoPlainText"> after-sb-1pri discard-secondary;<o:p></o:p></p>
<p class="MsoPlainText"> verify-alg "md5";<o:p></o:p></p>
<p class="MsoPlainText"> ping-timeout 10;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> on clu01-prox01 {<o:p></o:p></p>
<p class="MsoPlainText"> device /dev/drbd1;<o:p></o:p></p>
<p class="MsoPlainText"> disk /dev/sdb1;<o:p></o:p></p>
<p class="MsoPlainText"> address <IP-REMOVED>:7788;<o:p></o:p></p>
<p class="MsoPlainText"> meta-disk internal;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> on clu01-prox02 {<o:p></o:p></p>
<p class="MsoPlainText"> device /dev/drbd1;<o:p></o:p></p>
<p class="MsoPlainText"> disk /dev/sdb1;<o:p></o:p></p>
<p class="MsoPlainText"> address <IP-REMOVED>:7788;<o:p></o:p></p>
<p class="MsoPlainText"> meta-disk internal;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText">}<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">/etc/drbd.d/ssd.res:<o:p></o:p></p>
<p class="MsoPlainText">resource ssd {<o:p></o:p></p>
<p class="MsoPlainText"> protocol C;<o:p></o:p></p>
<p class="MsoPlainText"> startup {<o:p></o:p></p>
<p class="MsoPlainText"> wfc-timeout 0;<o:p></o:p></p>
<p class="MsoPlainText"> degr-wfc-timeout 60;<o:p></o:p></p>
<p class="MsoPlainText"> become-primary-on both;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"> disk {<o:p></o:p></p>
<p class="MsoPlainText"> c-fill-target 10M;<o:p></o:p></p>
<p class="MsoPlainText"> c-max-rate 700M;<o:p></o:p></p>
<p class="MsoPlainText"> c-plan-ahead 7;<o:p></o:p></p>
<p class="MsoPlainText"> c-min-rate 4M;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> net {<o:p></o:p></p>
<p class="MsoPlainText"> max-buffers 36864;<o:p></o:p></p>
<p class="MsoPlainText"> sndbuf-size 1048576; # bytes<o:p></o:p></p>
<p class="MsoPlainText"> rcvbuf-size 2097152; # bytes<o:p></o:p></p>
<p class="MsoPlainText"> allow-two-primaries yes;<o:p></o:p></p>
<p class="MsoPlainText"> cram-hmac-alg "sha1";<o:p></o:p></p>
<p class="MsoPlainText"> shared-secret "<SECRET-REMOVED>";<o:p></o:p></p>
<p class="MsoPlainText"> after-sb-0pri discard-zero-changes;<o:p></o:p></p>
<p class="MsoPlainText"> after-sb-1pri discard-secondary;<o:p></o:p></p>
<p class="MsoPlainText"> verify-alg "md5";<o:p></o:p></p>
<p class="MsoPlainText"> ping-timeout 10;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> on clu01-prox01 {<o:p></o:p></p>
<p class="MsoPlainText"> device /dev/drbd0;<o:p></o:p></p>
<p class="MsoPlainText"> disk /dev/sda4;<o:p></o:p></p>
<p class="MsoPlainText"> address <IP-REMOVED>:7787;<o:p></o:p></p>
<p class="MsoPlainText"> meta-disk internal;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"> on clu01-prox02 {<o:p></o:p></p>
<p class="MsoPlainText"> device /dev/drbd0;<o:p></o:p></p>
<p class="MsoPlainText"> disk /dev/sda4;<o:p></o:p></p>
<p class="MsoPlainText"> address <IP-REMOVED>:7787;<o:p></o:p></p>
<p class="MsoPlainText"> meta-disk internal;<o:p></o:p></p>
<p class="MsoPlainText"> }<o:p></o:p></p>
<p class="MsoPlainText">}<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">dmesg on clu01-prox01:<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: sock was shut down by peer<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: short read (expected size 16)<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: ack_receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: Terminating drbd_a_sas<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] block drbd1: new current UUID 3EBD520B3C71E1F5:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: Connection closed<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: conn( BrokenPipe -> Unconnected )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: Restarting receiver thread<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: receiver (re)started<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: conn( Unconnected -> WFConnection )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Handshake successful: Agreed network protocol version 101<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Peer authenticated using 20 bytes HMAC<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFConnection -> WFReportParams )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Starting ack_recv thread (from drbd_r_sas [2860])<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: drbd_sync_handshake:<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: self 3EBD520B3C71E1F5:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: peer 94F2E64E6B58A76D:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: uuid_compare()=100 by rule 90<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: Split-Brain detected but unresolved, dropping connection!<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFReportParams -> Disconnecting )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: error receiving ReportState, e: -5 l: 0!<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: ack_receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_a_sas<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Connection closed<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: conn( Disconnecting -> StandAlone )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_r_sas<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">dmesg on clu01-prox02:<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: PingAck did not arrive in time.<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] block drbd1: new current UUID 94F2E64E6B58A76D:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: ack_receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: Terminating drbd_a_sas<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: Connection closed<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: conn( NetworkFailure -> Unconnected )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: Restarting receiver thread<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: receiver (re)started<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:39 2022] drbd sas: conn( Unconnected -> WFConnection )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Handshake successful: Agreed network protocol version 101<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Peer authenticated using 20 bytes HMAC<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFConnection -> WFReportParams )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Starting ack_recv thread (from drbd_r_sas [1841])<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: drbd_sync_handshake:<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: self 94F2E64E6B58A76D:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: peer 3EBD520B3C71E1F5:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: uuid_compare()=100 by rule 90<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: Split-Brain detected but unresolved, dropping connection!<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFReportParams -> Disconnecting )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: error receiving ReportState, e: -5 l: 0!<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: ack_receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_a_sas<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Connection closed<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: conn( Disconnecting -> StandAlone )<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: receiver terminated<o:p></o:p></p>
<p class="MsoPlainText">[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_r_sas<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Best Regards,<o:p></o:p></p>
<p class="MsoNormal">Christian<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>