[DRBD-user] two-node proxmox 7 cluster drbd 8.4.11 terminating

Christian Haingartner c.haingartner at sysup.at
Tue Feb 1 16:16:58 CET 2022


Hi list,



we have sporadically problems with drbd 8.4.11 on 4 productive two-node proxmox 7 clusters.

We use drbd in dual primary mode and the virtual server on both nodes have their virtio hard disks in the same drbd storage.

Sometimes twice a week till somtimes once in three months, sometimes during the day and sometimes in the middle of the night the drbd connection terminates.



All 4 clusters are direct attached with 10Gb NIC connection (some with bonding in active-backup mode and some without bonding).

The Hardware including NIC, RAID controller and hard disks are different on the 4 clusters and equal on the respective two nodes.

Processor, RAM and Disk Utilization are the same as always when the problem occurs (according to our monitoring system).



Below is the configuration used on one of the affected clusters, which is the same we are using on older systems for a long time without any problems (it is only slightly modified from cluster to cluster depending on hard disk and network or for tests to solve this problem).

The kernel version on this cluster nodes ist 5.4.157-1-pve.

The problem occours sometimes on the sas resource and sometimes on the ssd resource but not on both at the same time and also on clusters with only one ssd resource.



/etc/drbd.d/global_common.conf:

common { syncer { rate 1024M; verify-alg md5; } }



/etc/drbd.d/sas.res:

resource sas {

        protocol C;

        startup {

                wfc-timeout  0;

                degr-wfc-timeout 60;

                become-primary-on both;

        }

        disk {

               c-fill-target 24M; # 10M

               c-max-rate   700M;

               c-plan-ahead    0; # 7

               c-min-rate     4M;

        }



        net {

                max-buffers             36864;

                sndbuf-size             1048576; # bytes

                rcvbuf-size             2097152; # bytes

                allow-two-primaries     yes;

                cram-hmac-alg           "sha1";

                shared-secret           "<SECRET-REMOVED>";

                after-sb-0pri           discard-zero-changes;

                after-sb-1pri           discard-secondary;

                verify-alg              "md5";

                ping-timeout            10;

        }



        on clu01-prox01 {

                device /dev/drbd1;

                disk /dev/sdb1;

                address <IP-REMOVED>:7788;

                meta-disk internal;

        }



        on clu01-prox02 {

                device /dev/drbd1;

                disk /dev/sdb1;

                address <IP-REMOVED>:7788;

                meta-disk internal;

        }

}



/etc/drbd.d/ssd.res:

resource ssd {

        protocol C;

        startup {

                wfc-timeout  0;

                degr-wfc-timeout 60;

                become-primary-on both;

        }

        disk {

               c-fill-target 10M;

               c-max-rate   700M;

               c-plan-ahead    7;

               c-min-rate     4M;

        }



        net {

                max-buffers             36864;

                sndbuf-size             1048576; # bytes

                rcvbuf-size             2097152; # bytes

                allow-two-primaries     yes;

                cram-hmac-alg           "sha1";

                shared-secret           "<SECRET-REMOVED>";

                after-sb-0pri           discard-zero-changes;

                after-sb-1pri           discard-secondary;

                verify-alg              "md5";

                ping-timeout            10;

        }



        on clu01-prox01 {

                device /dev/drbd0;

                disk /dev/sda4;

                address <IP-REMOVED>:7787;

                meta-disk internal;

        }



        on clu01-prox02 {

                device /dev/drbd0;

                disk /dev/sda4;

                address <IP-REMOVED>:7787;

                meta-disk internal;

        }

}



dmesg on clu01-prox01:



[Tue Jan 18 12:16:39 2022] drbd sas: sock was shut down by peer

[Tue Jan 18 12:16:39 2022] drbd sas: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown )

[Tue Jan 18 12:16:39 2022] drbd sas: short read (expected size 16)

[Tue Jan 18 12:16:39 2022] drbd sas: ack_receiver terminated

[Tue Jan 18 12:16:39 2022] drbd sas: Terminating drbd_a_sas

[Tue Jan 18 12:16:39 2022] block drbd1: new current UUID 3EBD520B3C71E1F5:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B

[Tue Jan 18 12:16:39 2022] drbd sas: Connection closed

[Tue Jan 18 12:16:39 2022] drbd sas: conn( BrokenPipe -> Unconnected )

[Tue Jan 18 12:16:39 2022] drbd sas: receiver terminated

[Tue Jan 18 12:16:39 2022] drbd sas: Restarting receiver thread

[Tue Jan 18 12:16:39 2022] drbd sas: receiver (re)started

[Tue Jan 18 12:16:39 2022] drbd sas: conn( Unconnected -> WFConnection )

[Tue Jan 18 12:16:40 2022] drbd sas: Handshake successful: Agreed network protocol version 101

[Tue Jan 18 12:16:40 2022] drbd sas: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.

[Tue Jan 18 12:16:40 2022] drbd sas: Peer authenticated using 20 bytes HMAC

[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFConnection -> WFReportParams )

[Tue Jan 18 12:16:40 2022] drbd sas: Starting ack_recv thread (from drbd_r_sas [2860])

[Tue Jan 18 12:16:40 2022] block drbd1: drbd_sync_handshake:

[Tue Jan 18 12:16:40 2022] block drbd1: self 3EBD520B3C71E1F5:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0

[Tue Jan 18 12:16:40 2022] block drbd1: peer 94F2E64E6B58A76D:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0

[Tue Jan 18 12:16:40 2022] block drbd1: uuid_compare()=100 by rule 90

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)

[Tue Jan 18 12:16:40 2022] block drbd1: Split-Brain detected but unresolved, dropping connection!

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)

[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFReportParams -> Disconnecting )

[Tue Jan 18 12:16:40 2022] drbd sas: error receiving ReportState, e: -5 l: 0!

[Tue Jan 18 12:16:40 2022] drbd sas: ack_receiver terminated

[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_a_sas

[Tue Jan 18 12:16:40 2022] drbd sas: Connection closed

[Tue Jan 18 12:16:40 2022] drbd sas: conn( Disconnecting -> StandAlone )

[Tue Jan 18 12:16:40 2022] drbd sas: receiver terminated

[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_r_sas



dmesg on clu01-prox02:



[Tue Jan 18 12:16:39 2022] drbd sas: PingAck did not arrive in time.

[Tue Jan 18 12:16:39 2022] drbd sas: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

[Tue Jan 18 12:16:39 2022] block drbd1: new current UUID 94F2E64E6B58A76D:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B

[Tue Jan 18 12:16:39 2022] drbd sas: ack_receiver terminated

[Tue Jan 18 12:16:39 2022] drbd sas: Terminating drbd_a_sas

[Tue Jan 18 12:16:39 2022] drbd sas: Connection closed

[Tue Jan 18 12:16:39 2022] drbd sas: conn( NetworkFailure -> Unconnected )

[Tue Jan 18 12:16:39 2022] drbd sas: receiver terminated

[Tue Jan 18 12:16:39 2022] drbd sas: Restarting receiver thread

[Tue Jan 18 12:16:39 2022] drbd sas: receiver (re)started

[Tue Jan 18 12:16:39 2022] drbd sas: conn( Unconnected -> WFConnection )

[Tue Jan 18 12:16:40 2022] drbd sas: Handshake successful: Agreed network protocol version 101

[Tue Jan 18 12:16:40 2022] drbd sas: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.

[Tue Jan 18 12:16:40 2022] drbd sas: Peer authenticated using 20 bytes HMAC

[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFConnection -> WFReportParams )

[Tue Jan 18 12:16:40 2022] drbd sas: Starting ack_recv thread (from drbd_r_sas [1841])

[Tue Jan 18 12:16:40 2022] block drbd1: drbd_sync_handshake:

[Tue Jan 18 12:16:40 2022] block drbd1: self 94F2E64E6B58A76D:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0

[Tue Jan 18 12:16:40 2022] block drbd1: peer 3EBD520B3C71E1F5:95CE24B4D7282A27:716814458AF45B2B:716714458AF45B2B bits:0 flags:0

[Tue Jan 18 12:16:40 2022] block drbd1: uuid_compare()=100 by rule 90

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0)

[Tue Jan 18 12:16:40 2022] block drbd1: Split-Brain detected but unresolved, dropping connection!

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1

[Tue Jan 18 12:16:40 2022] block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0)

[Tue Jan 18 12:16:40 2022] drbd sas: conn( WFReportParams -> Disconnecting )

[Tue Jan 18 12:16:40 2022] drbd sas: error receiving ReportState, e: -5 l: 0!

[Tue Jan 18 12:16:40 2022] drbd sas: ack_receiver terminated

[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_a_sas

[Tue Jan 18 12:16:40 2022] drbd sas: Connection closed

[Tue Jan 18 12:16:40 2022] drbd sas: conn( Disconnecting -> StandAlone )

[Tue Jan 18 12:16:40 2022] drbd sas: receiver terminated

[Tue Jan 18 12:16:40 2022] drbd sas: Terminating drbd_r_sas



Best Regards,
Christian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20220201/5fad8356/attachment-0001.htm>


More information about the drbd-user mailing list