[DRBD-user] primary lockup

Tom Brown brown at esteem.com
Thu Feb 21 17:29:03 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

We are having trouble with our primary node locking up just about every
night. This happened once about two weeks ago. Then again Monday,
Wednesday and this morning. We didn't notice the node was hung until
Tuesday morning because everybody was off work on Monday. The failover
to the secondary node goes as expected, except the secondary is not
accessible via the network. That is another problem.

What I'm trying to determine is if the problem with the primary is
indicative of old cheap hardware starting to go bad.

The log of the failure is given below along with the drbd.conf. The time
the failure occurs, a backup is being done of the drbd2 device to an
external USB drive. The backup has been working just fine for us for a
few months. No new software has been added to the system and no kernel
upgrades have been done.

Any ideas?

Thanks,
Tom

Distro: Debian Etch
Kernel: 2.6.22.6
DRBD: 8.0.6
Heartbeat: 2.1.2

/var/log/syslog:
Feb 21 00:04:18 zan kernel: hdc: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Feb 21 00:04:18 zan kernel: hdc: dma_intr: error=0x40
{ UncorrectableError }, LBAsect=1262256273, high=75, low=3965073,
sector=1262256267
Feb 21 00:04:18 zan kernel: ide: failed opcode was: unknown
Feb 21 00:04:18 zan kernel: end_request: I/O error, dev hdc, sector
1262256267
Feb 21 00:04:18 zan kernel: drbd2: got an _req_mod() errno of -5
Feb 21 00:04:18 zan kernel: drbd2: Local READ failed sec=1261436952s
size=4096
Feb 21 00:04:18 zan kernel: drbd2: disk( UpToDate -> Failed ) 
Feb 21 00:04:18 zan kernel: drbd2: Local IO failed. Detaching...
Feb 21 00:04:18 zan kernel: drbd2: helper command: /sbin/drbdadm
pri-on-incon-degr
Feb 21 00:04:18 zan kernel: drbd2: Sorry, I have no access to good data
anymore.
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 8210
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: ReiserFS: drbd2: warning: journal-837: IO
error during journal replay
Feb 21 00:04:18 zan kernel: REISERFS: abort (device drbd2): Write error
while updating journal header in flush_journal_list
Feb 21 00:04:18 zan kernel: REISERFS: Aborting journal for filesystem on
drbd2
Feb 21 00:04:18 zan kernel: drbd2: Sorry, I have no access to good data
anymore.
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4284
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: drbd2: Sorry, I have no access to good data
anymore.
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4285
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: drbd2: Sorry, I have no access to good data
anymore.
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4286
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: drbd2: Sorry, I have no access to good data
anymore.
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4287
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4288
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4289
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4290
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4291
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: Buffer I/O error on device drbd2, logical
block 4292
Feb 21 00:04:18 zan kernel: lost page write due to I/O error on drbd2
Feb 21 00:04:18 zan kernel: SysRq : HELP : loglevel0-8 reBoot Crashdump
tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw
Sync showTasks Unmount shoW-blocked-tasks 
Feb 21 07:48:45 zan syslogd 1.4.1#18: restart.

/etc/drbd.conf:
global {
    usage-count yes;
}

common {
  syncer { rate 22M; }
}

resource r0 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo O > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";   
  }
  startup {
    wfc-timeout  20;
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 22M;
    al-extents 257;
  }
  on zan {
    device     /dev/drbd0;
    disk       /dev/hdd1;
    address    192.168.1.3:7788;
    meta-disk  /dev/hdc1 [0];
  }
  on jayna {
    device     /dev/drbd0;
    disk       /dev/hdd1;
    address    192.168.1.4:7788;
    meta-disk  /dev/hdc1 [0];
  }
}

resource r1 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo O > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";   
  }
  startup {
    wfc-timeout  20;
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 25M;
    after "r0";
    al-extents 257;
  }
  on zan {
    device     /dev/drbd1;
    disk       /dev/hdd2;
    address    192.168.2.3:7789;
    meta-disk  /dev/hdc2 [0];
  }
  on jayna {
    device     /dev/drbd1;
    disk       /dev/hdd2;
    address    192.168.2.4:7789;
    meta-disk  /dev/hdc2 [0];
  }
}

resource r2 {
  protocol C;
  handlers {
    pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo O > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/sbin/drbd-peer-outdater";   
  }
  startup {
    wfc-timeout  20;
    degr-wfc-timeout 120;    # 2 minutes.
  }
  disk {
    on-io-error   detach;
  }
  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }
  syncer {
    rate 26M;
    al-extents 257;
  }
  on zan {
    device     /dev/drbd2;
    disk       /dev/hdc4;
    address    192.168.3.3:7790;
    meta-disk  /dev/hdc3 [0];
  }
  on jayna {
    device     /dev/drbd2;
    disk       /dev/hdc4;
    address    192.168.3.4:7790;
    meta-disk  /dev/hdc3 [0];
  }
}





More information about the drbd-user mailing list