[DRBD-user] Problem with disk errors

Fri Jan 30 17:02:30 CET 2009

I have had a strange problem with DRBD 8.3 twice.  The server running as a
secondary had a disk problem and went diskless.  The primary then saw the
secondary was diskless and showed the transition for the secondary from
UpToDate to Diskless.  However, the primary still had problems with
timeouts. My question is what do I need to do to allow the primary to run
after the secondary has a disk problem?

My drbd.conf is as follows:

global {
    usage-count yes;
}

common {
  syncer { rate 25M; }
}

resource drbd0 {

  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater";
  }

  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
    on-io-error   detach;
  }

  net {
    max-buffers     2048;
    max-epoch-size  2048;
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  syncer {
    rate 25M;
    al-extents 257;
  }

  on bg-host-m1 {
    device     /dev/drbd0;
    disk       /dev/sdb2;
    address    172.20.0.1:7788;
    meta-disk  /dev/sdb1[0];
  }

  on bg-host-m2 {
    device    /dev/drbd0;
    disk      /dev/sdb2;
    address   172.20.0.2:7788;
    meta-disk /dev/sdb1[0];
  }
}

Here is a small part of /var/log/messages on the primary.  Also, please note
the Concurrent local write message:

Jan 29 11:10:32 bg-host-m2 iscsi_trgt: Logical Unit Reset (05) issued on
tid:3 l
un:2 by sid:1127000341282880 (Function Complete)
Jan 29 11:10:32 bg-host-m2 drbd0: istiod3[22036] Concurrent local write
detected
! [DISCARD L] new: 1471221576s +2048; pending: 1471221576s +2048
Jan 29 11:10:32 bg-host-m2 drbd0: istiod3[22036] Concurrent local write
detected
! [DISCARD L] new: 3606015618s +19968; pending: 3606015618s +19968
Jan 29 11:10:57 bg-host-m2 iscsi_trgt: Logical Unit Reset (05) issued on
tid:3 l
un:2 by sid:1127000341282880 (Function Complete)
Jan 29 11:10:59 bg-host-m2 ntpd[6722]: kernel time sync status change
4001Jan 29 11:11:06 bg-host-m2 drbd0: Got NegAck packet. Peer is in
troubles?
Jan 29 11:11:06 bg-host-m2 drbd0: Got NegAck packet. Peer is in troubles?Jan
29 11:11:06 bg-host-m2 drbd0: pdsk( UpToDate -> Diskless )
Jan 29 11:11:06 bg-host-m2 drbd0: Creating new current UUIDJan 29 11:11:06
bg-host-m2 drbd0: Got NegAck packet. Peer is in troubles?

Later on the primary:
an 29 11:11:06 bg-host-m2 drbd0: istiod3[22035] Concurrent local write
detected
! [DISCARD L] new: 3617118144s +3584; pending: 3617118144s +3584
Jan 29 11:12:05 bg-host-m2 nrpe[18563]: Could not read request from client,
bail
ing out...
Jan 29 11:12:18 bg-host-m2 INFO: task istiod5:22060 blocked for more than
120 seconds.
Jan 29 11:12:18 bg-host-m2 "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 29 11:12:18 bg-host-m2 istiod5       D 0000000000000000     0 22060
2Jan 29 11:12:18 bg-host-m2 ffff88006a523d30 0000000000000046
0000000000000806 ff
ffffffa001764eJan 29 11:12:18 bg-host-m2 ffff88006a51abb0 ffff88006a51b140
ffff88006a51ade0 00
000001a0018206
Jan 29 11:12:18 bg-host-m2 0000000000000246 ffff88012e5f9c80
ffff88012e379800 ffff88012c0237f0
Jan 29 11:12:18 bg-host-m2 Call Trace:
Jan 29 11:12:18 bg-host-m2 [<ffffffffa001764e>] megasas_make_sgl64+0x46/0x59
[megaraid_sas]

Here is the secondary:
an 29 11:11:06 bg-host-m1 sd 4:0:0:0: [sdb] Device not ready: ASC=0x4
ASCQ=0x0
Jan 29 11:11:06 bg-host-m1 end_request: I/O error, dev sdb, sector
1216507571
Jan 29 11:11:06 bg-host-m1 drbd0: disk( UpToDate -> Failed )
Jan 29 11:11:06 bg-host-m1 drbd0: Local IO failed. Detaching...
Jan 29 11:11:06 bg-host-m1 drbd0: disk( Failed -> Diskless )
Jan 29 11:11:06 bg-host-m1 drbd0: Notified peer that my disk is broken.

Then later on the secondary:
Jan 29 11:13:29 bg-host-m1 INFO: task drbd0_worker:32651 blocked for more
than 120 seconds.
Jan 29 11:13:29 bg-host-m1 "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 29 11:13:29 bg-host-m1 drbd0_worker  D 000000000000000a     0 32651
2
Jan 29 11:13:29 bg-host-m1 ffff88008f589e10 0000000000000046
ffff8801088f0000 0000000000000000

[lots more info deleted from this event]

This keeps repeating

-- 
Terry Hull
Network Resource Group, Inc. President

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090130/445e0af6/attachment.htm>