[DRBD-user] Server reboot on DRBD heavy load problem

Proskurin Kirill proskurin-kv at fxclub.org
Thu Sep 23 08:51:48 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Someone?...

On 20/09/10 15:43, Proskurin Kirill wrote:
> Hello all.
>
> I fight with strange problem for more than a 3 week.
>
> What we have:
> 2xDell 2950 with Debian 5.0 2.6.32-bpo.5-amd64 from backports with DRBD
> inside + OCFS2
>
> I make a heavy load by iozone on OCFS2 partition:
> iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls
> on both nodes.
>
> And after a 1-3 hour servers(both) reboots. It is DRBD or OCFS2 related
> just because it is not happend on normal partition. OCFS2 developers
> look at stack trace what I catch and say what it is not a OCFS2 problem.
> I try to send you an screenshot of a stacktrace but run in a 40kb limit
> of message(it is 32kb) Below will be some strace what i gave on console.
>
> I start to think what it is hardware or system. I try 2.6.26 kernel and
> updating to testing - not helps at all.
>
> So - it is hardware or DRBD. Could you please help me to find out there
> problem is?
>
>
>
> Configs below:
>
> mail01:~# cat /etc/drbd.d/drbd0.res
> resource drbd0 {
>
> on mail01.fxclub.org {
> device /dev/drbd0;
> disk /dev/sda9;
> address 192.168.1.1:7789;
> meta-disk internal;
> }
>
> on mail02.fxclub.org {
> device /dev/drbd0;
> disk /dev/sda9;
> address 192.168.1.2:7789;
> meta-disk internal;
> }
>
> }
>
>
> mail01:~# cat /etc/drbd.d/global_common.conf
> global {
> usage-count yes;
> # minor-count dialog-refresh disable-ip-verification
> }
>
> common {
> protocol C;
>
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
> reboot -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
> ; halt -f";
> outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p
> 15 -- -c 16k";
> # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
> }
>
> startup {
> wfc-timeout 60;
> degr-wfc-timeout 30;
> outdated-wfc-timeout 15;
> become-primary-on both;
> # wait-after-sb;
> }
>
> disk {
> fencing resource-and-stonith;
> no-disk-flushes;
> no-md-flushes;
> no-disk-barrier;
> # on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
> # no-disk-drain no-md-flushes max-bio-bvecs
> }
>
> net {
> cram-hmac-alg sha1;
> shared-secret "password";
> allow-two-primaries;
> ping-timeout 20;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> data-integrity-alg sha1;
> # Tuning
> max-buffers 8000;
> max-epoch-size 8000;
> sndbuf-size 0;
> # snd.buf-size rcvbuf-size timeout connect-int ping-int ping-timeout
> max-buffers
> # max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
> # after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
> }
>
> syncer {
> rate 60M;
> al-extents 3389;
> # rate after al-extents use-rle cpu-mask verify-alg csums-alg
> }
> }
>
> P.S. I start to think what it can be a handlers and comented them - not
> help.
>
> Message from syslogd at mail02 at Sep 16 09:03:19 ...
> kernel:[92182.173794] ------------[ cut here ]------------
>
> Message from syslogd at mail02 at Sep 16 09:03:19 ...
> kernel:[92182.173872] invalid opcode: 0000 [#1] SMP
>
> Message from syslogd at mail02 at Sep 16 09:03:19 ...
> kernel:[92182.173899] last sysfs file: /sys/module/ocfs2/refcnt
>
>
> Message from syslogd at mail01 at Sep 16 15:18:37 ...
> kernel:[ 1432.310479] ------------[ cut here ]------------
>
> Message from syslogd at mail01 at Sep 16 15:18:37 ...
> kernel:[ 1432.310648] invalid opcode: 0000 [#1] SMP
>
> Message from syslogd at mail01 at Sep 16 15:18:37 ...
> kernel:[ 1432.310801] last sysfs file: /sys/fs/o2cb/interface_revision
>
> Message from syslogd at mail01 at Sep 16 15:18:37 ...
> kernel:[ 1432.312251] Stack:
>
> Message from syslogd at mail01 at Sep 16 15:18:37 ...
> kernel:[ 1432.312251] Call Trace:
>
> Message from syslogd at mail01 at Sep 16 15:18:37 ...
> kernel:[ 1432.312251] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86
> 89 00 00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9
> 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c e9 94 58 fd ff 48 8b 4c 24 18 4c
> 8b 4f
>


-- 
Best regards,
Proskurin Kirill



More information about the drbd-user mailing list