[DRBD-user] Freeze on primary

Thu Nov 23 11:05:36 CET 2006

I'm running a drbd cluster with heartbeat. Everything (dns, nfsd, imapd, 
postgres) worked fine until I started to handle large Files. This cause me to 
do my first post in this mailing list.

My cluster contains 2 absolute identical mashines:

Dual Opteron 2GHz on Tyan S2882
2GB Memory
3 SCSI-Disks (1 System 40GB, 2 DRBD 146GB, one of them changed to new 300GB)
2 GBit Networkports (1 external connect, 1 DRBD exclusiv use via crossover 
cable)

If I try to copy a large tar (i.e. 44GB) to the primary or in some cases untar 
it, I got to 90% a freeze on this mashine and the secondary heartbeat server 
starts to work as primary without any problems. After pushing the power 
button for resetting the primary, resyncing works well.
The primary server do well without DRBD.
Sometimes (NOT IN EVERY CASES) I found a lot of messages like this over a 
large time interval in /var/log/messages:

Nov 22 16:24:52 ledha1 kernel: scsi0: Address or Write Phase Parity Error 
Detected in TARG.
Nov 22 16:24:52 ledha1 kernel: drbd0: Resync done (total 72 sec; paused 0 sec; 
14940 K/sec)
Nov 22 16:24:52 ledha1 kernel: drbd0: drbd0_worker [3203]: cstate SyncTarget 
--> Connected
Nov 22 16:24:52 ledha1 kernel: sd 0:0:0:0: Attempting to queue an ABORT 
message:CDB: 0x2a 0x0 0x2 0x65 0xa1 0xaf 0x0 0x0 0x8 0x0
Nov 22 16:24:52 ledha1 kernel: scsi0: At time of recovery, card was not paused
Nov 22 16:24:52 ledha1 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins 
<<<<<<<<<<<<<<<<<
Nov 22 16:24:52 ledha1 kernel: scsi0: Dumping Card State at program address 
0x24 Mode 0x22
Nov 22 16:24:52 ledha1 kernel: Card was paused
Nov 22 16:24:52 ledha1 kernel: INTSTAT[0x0] SELOID[0x1] SELID[0x10] 
HS_MAILBOX[0x0]
Nov 22 16:24:52 ledha1 kernel: INTCTL[0xc0]:(SWTMINTEN|SWTMINTMASK) 
SEQINTSTAT[0x10]:(SEQ_SWTMRTO)
Nov 22 16:24:52 ledha1 kernel: SAVED_MODE[0x11] DFFSTAT[0x33]:(CURRFIFO_NONE|
FIFO0FREE|FIFO1FREE)
Nov 22 16:24:52 ledha1 kernel: SCSISIGI[0x0]:(P_DATAOUT) SCSIPHASE[0x0] 
SCSIBUS[0x0]
Nov 22 16:24:52 ledha1 kernel: LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE) SCSISEQ0
[0x0]
Nov 22 16:24:52 ledha1 kernel: SCSISEQ1[0x12]:(ENAUTOATNP|ENRSELI) SEQCTL0
[0x0]
Nov 22 16:24:52 ledha1 kernel: SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x4]:
(SELECTOUT_QFROZEN)
Nov 22 16:24:52 ledha1 kernel: QFREEZE_COUNT[0x6] KERNEL_QFREEZE_COUNT[0x6] 
MK_MESSAGE_SCB[0xff00]
Nov 22 16:24:52 ledha1 kernel: MK_MESSAGE_SCSIID[0xff] SSTAT0[0x0] SSTAT1[0x0]
Nov 22 16:24:52 ledha1 kernel: SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1
[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO)
Nov 22 16:24:52 ledha1 kernel: LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] 
LQOSTAT0[0x0]
Nov 22 16:24:52 ledha1 kernel: LQOSTAT1[0x0] LQOSTAT2[0xe1]:(LQOSTOP0|LQOPKT)
Nov 22 16:24:52 ledha1 kernel:
Nov 22 16:24:52 ledha1 kernel: SCB Count = 80 CMDS_PENDING = 1 LASTSCB 0xa 
CURRSCB 0x2f NEXTSCB 0xff80
Nov 22 16:24:52 ledha1 kernel: qinstart = 45596 qinfifonext = 45596
Nov 22 16:24:52 ledha1 kernel: QINFIFO:
Nov 22 16:24:52 ledha1 kernel: WAITING_TID_QUEUES:
Nov 22 16:24:52 ledha1 kernel: Pending list:
Nov 22 16:24:52 ledha1 kernel:  37 FIFO_USE[0x0] SCB_CONTROL[0x68]:
(STATUS_RCVD|TAG_ENB|DISCENB)
Nov 22 16:24:52 ledha1 kernel: SCB_SCSIID[0x7]
Nov 22 16:24:52 ledha1 kernel: Total 1
Nov 22 16:24:52 ledha1 kernel: Kernel Free SCB list: 47 10 70 57 67 72 2 0 42 
14 68 5 69 30 52 43 32 16 15 60 24 53 38 71 36 31 74 75 54 29 9 59 78 79 41 
40 61 45 7 1 25 20 26 44 22 46 8 19 12 48 51 56 55 58 23 39 50 64 49 18 6 35 
3 11 73 4 21 33 27 63 65 62 34 13 66 17 28 77 76
Nov 22 16:24:52 ledha1 kernel: Sequencer Complete DMA-inprog list:
Nov 22 16:24:52 ledha1 kernel: Sequencer Complete list:
Nov 22 16:24:52 ledha1 kernel: Sequencer DMA-Up and Complete list:
Nov 22 16:24:52 ledha1 kernel: Sequencer On QFreeze and Complete list:
Nov 22 16:24:52 ledha1 kernel:
Nov 22 16:24:52 ledha1 kernel:
Nov 22 16:24:52 ledha1 kernel: scsi0: FIFO0 Free, LONGJMP == 0x8271, SCB 0x2f
Nov 22 16:24:52 ledha1 kernel: SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|
ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
Nov 22 16:24:52 ledha1 kernel: SEQINTSRC[0x0] DFCNTRL[0x4]:(DIRECTION) 
DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL)
Nov 22 16:24:52 ledha1 kernel: SG_CACHE_SHADOW[0x2]:(LAST_SEG) SG_STATE[0x0] 
DFFSXFRCTL[0x0]
Nov 22 16:24:52 ledha1 kernel: SOFFCNT[0x0] MDFFSTAT[0x5]:(FIFOFREE|DLZERO) 
SHADDR = 0x00, SHCNT = 0x0
Nov 22 16:24:52 ledha1 kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]:
(SG_CACHE_AVAIL)
Nov 22 16:24:52 ledha1 kernel:
Nov 22 16:24:52 ledha1 kernel: scsi0: FIFO1 Free, LONGJMP == 0x8271, SCB 0x43
Nov 22 16:24:52 ledha1 kernel: SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|
ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
Nov 22 16:24:52 ledha1 kernel: SEQINTSRC[0x0] DFCNTRL[0x4]:(DIRECTION) 
DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL)
Nov 22 16:24:52 ledha1 kernel: SG_CACHE_SHADOW[0x2]:(LAST_SEG) SG_STATE[0x0] 
DFFSXFRCTL[0x0]
Nov 22 16:24:52 ledha1 kernel: SOFFCNT[0x0] MDFFSTAT[0x5]:(FIFOFREE|DLZERO) 
SHADDR = 0x00, SHCNT = 0x0
Nov 22 16:24:52 ledha1 kernel: HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]:
(SG_CACHE_AVAIL)
Nov 22 16:24:52 ledha1 kernel: LQIN: 0x8 0x0 0x0 0x2f 0x0 0x0 0x0 0x0 0x0 0x0 
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Nov 22 16:24:52 ledha1 kernel: scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, 
OPTIONMODE = 0x52
Nov 22 16:24:52 ledha1 kernel: scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
Nov 22 16:24:52 ledha1 kernel: scsi0: SAVED_SCSIID = 0x0 SAVED_LUN = 0x0
Nov 22 16:24:52 ledha1 kernel:
Nov 22 16:24:52 ledha1 kernel: SIMODE0[0xc]:(ENOVERRUN|ENIOERR)
Nov 22 16:24:52 ledha1 kernel: CCSCBCTL[0x0]
Nov 22 16:24:52 ledha1 kernel: scsi0: REG0 == 0x2f, SINDEX = 0x104, DINDEX = 
0x104
Nov 22 16:24:52 ledha1 kernel: scsi0: SCBPTR == 0x2f, SCB_NEXT == 0xff80, 
SCB_NEXT2 == 0xff2a
Nov 22 16:24:52 ledha1 kernel: CDB 2a 0 22 80 8 ec
Nov 22 16:24:52 ledha1 kernel: STACK: 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
Nov 22 16:24:52 ledha1 kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends 
>>>>>>>>>>>>>>>>>>

I didn't see this without DRBD until now.

I had this freeze under several configurations

My first configuration was based on Suse 9.3 (X86-64) and ext3 Filesystem.
Because of the unsupported state of the DRBD version on Suse 9.3 I changed it 
to
2nd: Suse 10.1 (X86-64), reiserfs, DRBD 0.7.18 from Suse Dist. (incl. exchange 
of one disk and memory)
3rd and last try was an upgrade on the Suse 10.1 to DRBD 0.7.22 by hand.

There in no problem in "normal use" with small data handling!

I did hardware tests without DRBD (running mprime on both processors of the 
mashine while copying a large file and untar another simultaneously, to get 
the server on his knees, but it did all well.
I also tried to fix the problem by slowing down the SCSI-Adapter from 320 to 
160, without success.

Here is my drbd.conf:

resource r0 {

  protocol C;

  # what should be done in case the cluster starts up in
  # degraded mode, but knows it has inconsistent data.
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt 
-f";

  startup {
    # Wait for connection timeout.
    # The init script blocks the boot process until the resources
    # are connected. This is so when the cluster manager starts later,
    # it does not see a resource with internal split-brain.
    # In case you want to limit the wait time, do it here.
    # Default is 0, which means unlimited. Unit is seconds.
    #
    # wfc-timeout  0;

    # Wait for connection timeout if this node was a degraded cluster.
    # In case a degraded cluster (= cluster with only one node left)
    # is rebooted, this timeout value is used.
    #
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
    # if the lower level device reports io-error you have the choice of
    #  "pass_on"  ->  Report the io-error to the upper layers.
    #                 Primary   -> report it to the mounted file system.
    #                 Secondary -> ignore it.
    #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
    #  "detach"   ->  The node drops its backing storage device, and
    #                 continues in disk less mode.
    #
    on-io-error   detach;

    # Enables the use of the outdate-peer handler, as well as freezing
    # of IO while we are primary and the peer's disk state is unknown.
    #  The outdate-peer handler is used then to resove the situation
    #  as quick as possible.
    # BTW, becoming primary on a disconnected node may also trigger the
    # execution of the outdate-peer handler.
    # split-brain-fix;
  }

  net {
    # this is the size of the tcp socket send buffer
    # increase it _carefully_ if you want to use protocol A over a
    # high latency network with reasonable write throughput.
    # defaults to 2*65535; you might try even 1M, but if your kernel or
    # network driver chokes on that, you have been warned.
    # sndbuf-size 512k;

    # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
    # connect-int   10;    # 10 seconds  (unit = 1 second)
    # ping-int      10;    # 10 seconds  (unit = 1 second)

    # Maximal number of requests (4K) to be allocated by DRBD.
    # The minimum is hardcoded to 32 (=128 kb).
    # For hight performance installations it might help if you
    # increase that number. These buffers are used to hold
    # datablocks while they are written to disk.
    #
    # max-buffers     2048;

    # The highest number of data blocks between two write barriers.
    # If you set this < 10 you might decrease your performance.
    # max-epoch-size  2048;

    # if some block send times out this many times, the peer is
    # considered dead, even if it still answers ping requests.
    # ko-count 4;

    # if the connection to the peer is lost you have the choice of
    #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
    #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
    #  "freeze_io"   -> Try to reconnect but freeze all IO until
    #                   the connection is established again.
    # on-disconnect reconnect;

    # If you want to use OCFS2/openGFS on top of DRBD enable
    # this optione, and only enable it if you are going to use
    # one of these filesystems. Do not enable it for ext2,
    # ext3,reiserFS,XFS,JFS etc...
    # allow-two-primaries;

    # This enables peer authentication. Without this everybody
    # on the network could connect to one of your DRBD nodes with
    # a program that emulates DRBD's protocoll and could suck off
    # all your data.
    # Specify one of the kernel's digest algorithms, e.g.:
    # md5, sha1, sha256, sha512, wp256, wp384, wp512, michael_mic ...
    # an a shared secret.
    # Authentication is only done once after the TCP connection
    # is establised, there are no disadvantages from using authentication,
    # therefore I suggest to enable it in any case.
    # cram-hmac-alg "sha1";
    # shared-secret "FooFunFactory";
  }

  syncer {
    # Limit the bandwith used by the resynchronisation process.
    # default unit is KB/sec; optional suffixes K,M,G are allowed
    #
    rate 30M;

    # All devices in one group are resynchronized parallel.
    # Resychronisation of groups is serialized in ascending order.
    # Put DRBD resources which are on different physical disks in one group.
    # Put DRBD resources on one physical disk in different groups.
    #
    group 1;

    # Configures the size of the active set. Each extent is 4M,
    # 257 Extents ~> 1GB active set size. In case your syncer
    # runs @ 10MB/sec, all resync after a primary's crash will last
    # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds.
    # BTW, the hash algorithm works best if the number of al-extents
    # is prime. (To test the worst case performace use a power of 2)
    al-extents 257;
  }

  on ledha1 {
    device     /dev/drbd0;
    disk       /dev/sdb1;
    address    10.30.56.51:7788;
    meta-disk  internal;

    # meta-disk is either 'internal' or '/dev/ice/name [idx]'
    #
    # You can use a single block device to store meta-data
    # of multiple DRBD's.
    # E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1];
    # for two different resources. In this case the meta-disk
    # would need to be at least 256 MB in size.
    #
    # 'internal' means, that the last 128 MB of the lower device
    # are used to store the meta-data.
    # You must not give an index with 'internal'.
  }

  on ledha2 {
    device    /dev/drbd0;
    disk      /dev/sdb1;
    address   10.30.56.52:7788;
    meta-disk internal;
  }
}

#
# yes, you may also quote the resource name.
# but don't include whitespace, unless you mean it :)
#
resource "r1" {
  protocol C;
  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt 
-f";
  startup {
    wfc-timeout         0;  ## Infinite!
    degr-wfc-timeout  120;  ## 2 minutes.
  }
  disk {
    on-io-error detach;
  }
  net {
    # timeout           60;
    # connect-int       10;
    # ping-int          10;
    # max-buffers     2048;
    # max-epoch-size  2048;
  }
  syncer {
    rate   30M;
    group   1; # sync concurrently with r0
  }

  on ledha1 {
    device      /dev/drbd1;
    disk        /dev/sdc1;
    address     10.30.56.51:7789;
    meta-disk   internal;
  }

  on ledha2 {
    device     /dev/drbd1;
    disk       /dev/sdc1;
    address    10.30.56.52:7789;
    meta-disk  internal;
  }
}

Any idea, what went wrong on my system? I'm realy stumped.

Andreas