Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> Hi Lars,
>
> thanks for your reply.
>
> I have attached the requested information (I could not retrieve the
> output from /proc/sysrq-trigger, because I only have a ssh
> connection).
>
>
>>> The question is if that could be related to DRBD?
>>
>> hard to say yes or no without more information
>> about your setup and the nature of your stress tests.
>> you have to investigate that your self, I guess.
>>
>> drbd status during such periods?
>>
>> drbd messages or other "interessting" messages?
>>
>> does it help to disconnect (physically, if necessary) drbd?
>>
>> does it also happen with disconnected drbd (StandAlone)?
>
>
> I was running dbench for a duration of 7 hours on the primary node.
> After about half of that time, the described soft lockups occured. I
> managed to reboot the machine remotely using /proc/sysrq-trigger.
> While I was trying to check / mount the XFS filesystem on the
> secondary, I received similar softlockup errors that rendered the
> machine unusable and required a hard reboot. There are no unusual
> DRBD messages in the logfiles.
>
> Unfortunately, I neither have physical access to the machines nor
> time for testing, as there is a rather short timeframe for going
> productive. So I updated the kernel to 2.6.24-6 ("Debian Etch and a
> half") and manually compiled a newer version of my raid controller
> driver (megaraid_sas). Additionaly, I updated to DRBD 8.0.14, as
> this version had become available in Debian backports. I couldn't
> reproduce the error since then.
>
>
>>> I'm getting more and more convinced, that this issue is due to the
>>> "certified" scsi driver not working properly,
>>
>> why is that?
>
> Well, I'm using a slightly altered version of Debian Etch, which has
> been released for the specific hardware I use (FSC Primergy RX300S4
> with LSI 1078 RAID controller). However, the megaraid_sas driver for
> that specific controller is the regular Etch version, which in turn
> is taken from vanilla 2.6.18-6 and apparently has not been changed
> since. This version is over two years old and is known to have
> sporadic problems under heavy i/o, causing all kinds of symptoms in
> layers on top of the block device driver (i.e. filesystem errors).
>
> To conclude: I hope to have solved this issue for now, and it
> appears NOT to be related to DRBD. If I get any contradictory
> information, I'll get back to you (hopefully not ;-) ).
>
> Thanks again!
>
> Thomas
>
>
> ------------------------------------------------------------
> ps -eo pid,state,wchan:40,cmd:
>
>> PID S WCHAN CMD
>> 1 S - init [2]
>> 2 S migration_thread [migration/0]
>> 3 S ksoftirqd [ksoftirqd/0]
>> 4 S watchdog [watchdog/0]
>> 5 S migration_thread [migration/1]
>> 6 S ksoftirqd [ksoftirqd/1]
>> 7 S watchdog [watchdog/1]
>> 8 S migration_thread [migration/2]
>> 9 S ksoftirqd [ksoftirqd/2]
>> 10 S watchdog [watchdog/2]
>> 11 S migration_thread [migration/3]
>> 12 S ksoftirqd [ksoftirqd/3]
>> 13 S watchdog [watchdog/3]
>> 14 S migration_thread [migration/4]
>> 15 S ksoftirqd [ksoftirqd/4]
>> 16 S watchdog [watchdog/4]
>> 17 S migration_thread [migration/5]
>> 18 R - [ksoftirqd/5]
>> 19 R - [watchdog/5]
>> 20 R - [migration/6]
>> 21 R - [ksoftirqd/6]
>> 22 R - [watchdog/6]
>> 23 S migration_thread [migration/7]
>> 24 S ksoftirqd [ksoftirqd/7]
>> 25 S watchdog [watchdog/7]
>> 26 S worker_thread [events/0]
>> 27 S worker_thread [events/1]
>> 28 S worker_thread [events/2]
>> 29 S worker_thread [events/3]
>> 30 S worker_thread [events/4]
>> 31 R - [events/5]
>> 32 R - [events/6]
>> 33 S worker_thread [events/7]
>> 34 S worker_thread [khelper]
>> 35 S worker_thread [kthread]
>> 46 S worker_thread [kblockd/0]
>> 47 S worker_thread [kblockd/1]
>> 48 S worker_thread [kblockd/2]
>> 49 S worker_thread [kblockd/3]
>> 50 S worker_thread [kblockd/4]
>> 51 S worker_thread [kblockd/5]
>> 52 S worker_thread [kblockd/6]
>> 53 S worker_thread [kblockd/7]
>> 54 S worker_thread [kacpid]
>> 210 S hub_thread [khubd]
>> 212 S serio_thread [kseriod]
>> 292 D - [pdflush]
>> 293 D - [pdflush]
>> 294 S kswapd [kswapd0]
>> 295 S worker_thread [aio/0]
>> 296 S worker_thread [aio/1]
>> 297 S worker_thread [aio/2]
>> 298 S worker_thread [aio/3]
>> 299 S worker_thread [aio/4]
>> 300 S worker_thread [aio/5]
>> 301 S worker_thread [aio/6]
>> 302 S worker_thread [aio/7]
>> 861 S scsi_error_handler [scsi_eh_0]
>> 1147 S worker_thread [ata/0]
>> 1148 S worker_thread [ata/1]
>> 1149 S worker_thread [ata/2]
>> 1150 S worker_thread [ata/3]
>> 1151 S worker_thread [ata/4]
>> 1152 S worker_thread [ata/5]
>> 1153 S worker_thread [ata/6]
>> 1154 S worker_thread [ata/7]
>> 1155 S worker_thread [ata_aux]
>> 1170 D - [scsi_eh_1]
>> 1171 S scsi_error_handler [scsi_eh_2]
>> 1465 S kjournald [kjournald]
>> 1703 S - udevd --daemon
>> 2143 S worker_thread [kpsmoused]
>> 2515 S worker_thread [cqueue/0]
>> 2516 D drbd_req_state [cqueue/1]
>> 2517 S worker_thread [cqueue/2]
>> 2518 S worker_thread [cqueue/3]
>> 2519 S worker_thread [cqueue/4]
>> 2520 S worker_thread [cqueue/5]
>> 2521 S worker_thread [cqueue/6]
>> 2522 S worker_thread [cqueue/7]
>> 2558 S worker_thread [kmirrord]
>> 2583 S worker_thread [kcryptd/0]
>> 2584 S worker_thread [kcryptd/1]
>> 2585 S worker_thread [kcryptd/2]
>> 2586 S worker_thread [kcryptd/3]
>> 2587 S worker_thread [kcryptd/4]
>> 2588 S worker_thread [kcryptd/5]
>> 2589 S worker_thread [kcryptd/6]
>> 2590 S worker_thread [kcryptd/7]
>> 2625 S kjournald [kjournald]
>> 2758 S - /sbin/portmap
>> 2961 S - /sbin/syslogd
>> 2967 S syslog /sbin/klogd -x
>> 3070 S - /usr/sbin/acpid -
>> c /etc/acpi/events -s /var/run/acpid.socket
>> 3215 S - /usr/sbin/inetd
>> 3277 S - /usr/lib/postfix/
>> master
>> 3284 S - qmgr -l -t fifo -u
>> 3287 S - /usr/sbin/snmpd -
>> Lsd -Lf /dev/null -u snmp -p /var/run/snmpd.pid 127.0.0.1
>> 3293 S - /usr/sbin/sshd
>> 3330 S - /sbin/rpc.statd
>> 3475 S stext /opt/SMAW/RAID/
>> amDaemon
>> 3490 S - [drbd0_worker]
>> 3505 R - [drbd1_worker]
>> 3531 R - [drbd1_receiver]
>> 3542 R - [drbd1_asender]
>> 3555 S - /opt/SMAW/RAID/
>> amDaemon
>> 3560 S - /usr/sbin/atd
>> 3568 S - /usr/sbin/cron
>> 4197 S - /sbin/getty 38400
>> tty2
>> 4198 S - /sbin/getty 38400
>> tty3
>> 4199 S - /sbin/getty 38400
>> tty4
>> 4200 S - /sbin/getty 38400
>> tty5
>> 4201 S - /sbin/getty 38400
>> tty6
>> 5756 S - /sbin/getty 38400
>> tty1
>> 7396 S worker_thread [xfslogd/0]
>> 7397 S worker_thread [xfslogd/1]
>> 7398 S worker_thread [xfslogd/2]
>> 7399 S worker_thread [xfslogd/3]
>> 7400 S worker_thread [xfslogd/4]
>> 7401 R - [xfslogd/5]
>> 7402 S worker_thread [xfslogd/6]
>> 7403 S worker_thread [xfslogd/7]
>> 7404 S worker_thread [xfsdatad/0]
>> 7405 S worker_thread [xfsdatad/1]
>> 7406 S worker_thread [xfsdatad/2]
>> 7407 S worker_thread [xfsdatad/3]
>> 7408 S worker_thread [xfsdatad/4]
>> 7409 S worker_thread [xfsdatad/5]
>> 7410 S worker_thread [xfsdatad/6]
>> 7411 S worker_thread [xfsdatad/7]
>> 7454 S - ha_logd: read
>> process
>> 7462 S - ha_logd: write
>> process
>> 7581 S 562640683272 heartbeat: master
>> control process
>> 7584 S pipe_wait heartbeat: FIFO
>> reader
>> 7585 S 279172841735 heartbeat: write:
>> bcast eth1
>> 7586 S - heartbeat: read:
>> bcast eth1
>> 7587 S 279172841735 heartbeat: write:
>> mcast eth0
>> 7588 S - heartbeat: read:
>> mcast eth0
>> 7589 S 279172874239 heartbeat: write:
>> serial /dev/ttyS0
>> 7590 S - heartbeat: read:
>> serial /dev/ttyS0
>> 7609 S - /usr/lib64/
>> heartbeat/dopd
>> 1514 S stext /usr/sbin/eecd
>> 1518 S - /etc/srvmagt/SCS/
>> SVRemoteConnector -ci/etc/srvmagt/SCS/Provider/xml -ssl_servcert=/
>> etc/srvmagt/SCS/ssl/server_key.crt -ssl_capath=/etc/srvmagt/SCS/ssl
>> 1521 S - /usr/sbin/scagt
>> 1523 S stext /usr/sbin/sc2agt
>> 1525 S - /usr/sbin/busagt
>> 1527 D blk_execute_rq /usr/sbin/hdagt
>> 1530 S - /usr/sbin/unixagt
>> 1532 S - /usr/sbin/etheragt
>> 1534 S - /usr/sbin/biosagt
>> 1536 S - /usr/sbin/securagt
>> 1538 S stext /usr/sbin/statusagt
>> 1540 S - /usr/sbin/invagt
>> 1542 S stext /usr/sbin/vvagt
>> 1786 S - /usr/sbin/openvpn
>> --writepid /var/run/openvpn.client.pid --daemon ovpn-client --
>> status /var/run/openvpn.client.status 10 --cd /etc/openvpn --
>> config /etc/openvpn/client.conf
>> 7088 S - [xfsbufd]
>> 7094 D drbd_al_begin_io [xfsbufd]
>> 7095 D - [xfssyncd]
>> 8333 S - /usr/bin/vmnet-
>> bridge -d /var/run/vmnet-bridge-0.pid -n 0 -i eth0
>> 8598 S 11371755398898876679 /usr/sbin/vmware-
>> authdlauncher
>> 8608 S wait /bin/sh /usr/bin/
>> vmware-watchdog -s webAccess -u 30 -q 5 /usr/lib/vmware/webAccess/
>> java/jre1.5.0_15/bin/webAccess -client -Xmx64m -
>> XX:MinHeapFreeRatio=30 -XX:MaxHeapFreeRatio=30 -
>> Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -
>> Djava.endorsed.dirs=/usr/lib/vmware/webAccess/tomcat/apache-
>> tomcat-6.0.16/common/endorsed -classpath /usr/lib/vmware/webAccess/
>> tomcat/apache-tomcat-6.0.16/bin/bootstrap.jar:/usr/lib/vmware/
>> webAccess/tomcat/apache-tomcat-6.0.16/bin/commons-logging-api.jar -
>> Dcatalina.base=/usr/lib/vmware/webAccess/tomcat/apache-
>> tomcat-6.0.16 -Dcatalina.home=/usr/lib/vmware/webAccess/tomcat/
>> apache-tomcat-6.0.16 -Djava.io.tmpdir=/usr/lib/vmware/webAccess/
>> tomcat/apache-tomcat-6.0.16/temp
>> org.apache.catalina.startup.Bootstrap start
>> 8617 S stext /usr/lib/vmware/
>> webAccess/java/jre1.5.0_15/bin/webAccess -client -Xmx64m -
>> XX:MinHeapFreeRatio=30 -XX:MaxHeapFreeRatio=30 -
>> Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -
>> Djava.endorsed.dirs=/usr/lib/vmware/webAccess/tomcat/apache-
>> tomcat-6.0.16/common/endorsed -classpath /usr/lib/vmware/webAccess/
>> tomcat/apache-tomcat-6.0.16/bin/bootstrap.jar:/usr/lib/vmware/
>> webAccess/tomcat/apache-tomcat-6.0.16/bin/commons-logging-api.jar -
>> Dcatalina.base=/usr/lib/vmware/webAccess/tomcat/apache-
>> tomcat-6.0.16 -Dcatalina.home=/usr/lib/vmware/webAccess/tomcat/
>> apache-tomcat-6.0.16 -Djava.io.tmpdir=/usr/lib/vmware/webAccess/
>> tomcat/apache-tomcat-6.0.16/temp
>> org.apache.catalina.startup.Bootstrap start
>> 8782 Z exit [vmware-vmx]
>> <defunct>
>> 8801 S rtc_read [vmware-rtc]
>> 16026 D - dbench -t 25000 20
>> 16027 D - dbench -t 25000 20
>> 16028 D - dbench -t 25000 20
>> 16029 R - dbench -t 25000 20
>> 16030 D - dbench -t 25000 20
>> 16031 D - dbench -t 25000 20
>> 16032 D - dbench -t 25000 20
>> 16033 D - dbench -t 25000 20
>> 16034 D - dbench -t 25000 20
>> 16035 D - dbench -t 25000 20
>> 16036 D - dbench -t 25000 20
>> 16037 D - dbench -t 25000 20
>> 16038 D - dbench -t 25000 20
>> 16039 D - dbench -t 25000 20
>> 16040 D - dbench -t 25000 20
>> 16041 D - dbench -t 25000 20
>> 16042 D - dbench -t 25000 20
>> 16043 D - dbench -t 25000 20
>> 16044 D - dbench -t 25000 20
>> 16045 D - dbench -t 25000 20
>> 25562 D - ls --color=auto -la
>> 25711 D flush_cpu_workqueue vmrun -T server -h https://127.0.0.1:8333/sdk
>> -u "" -p "" stop [standard] testvm2/testvm.vmx soft
>> 25712 D - sshd: root at pts/3
>> 26119 D - sshd: root at pts/4
>> 26372 D - sshd: root at pts/5
>> 26498 D - sshd: root at pts/6
>> 26678 D - sshd: root at pts/7
>> 27455 D - sshd: root at pts/8
>> 28562 D - [bash]
>> 28691 D - shutdown -r 0 w
>> 29107 D - umount /srv/vmware/
>> virtual_machines
>> 29109 D - sshd: root at pts/12
>> 29264 D - sshd: root at pts/13
>> 29580 D - sshd: root at pts/14
>> 30031 D - sshd: root at pts/15
>> 30366 D - sshd: root at pts/16
>> 30555 D - sshd: root at pts/17
>> 30825 D - [drbdsetup]
>> 31001 D - sshd: root at pts/19
>> 31575 D - sshd: root at pts/20
>> 31815 D - sshd: root at pts/21
>> 32168 D - sshd: root at pts/22
>> 2473 S pipe_wait /USR/SBIN/CRON
>> 2474 S wait /bin/sh -c test -
>> x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
>> 2477 S wait /bin/sh -c test -
>> x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
>> 2478 S - run-parts --
>> report /etc/cron.daily
>> 2507 S wait /bin/sh /etc/
>> cron.daily/find
>> 2509 S wait /bin/sh /usr/bin/
>> updatedb
>> 2525 S wait /bin/sh /usr/bin/
>> updatedb
>> 2528 S pipe_wait /usr/bin/sort -z -f
>> 2529 S pipe_wait /usr/lib/locate/
>> frcode -0
>> 2533 S wait su nobody -s /bin/
>> sh -c /usr/bin/find / -ignore_readdir_race \( -fstype NFS -o -
>> fstype nfs -o -fstype nfs4 -o -fstype afs -o -fstype binfmt_misc -o
>> -fstype proc -o -fstype smbfs -o -fstype autofs -o -fstype iso9660 -
>> o -fstype ncpfs -o -fstype coda -o -fstype devpts -o -fstype ftpfs -
>> o -fstype devfs -o -fstype mfs -o -fstype shfs -o -fstype sysfs -o -
>> fstype cifs -o -fstype lustre_lite -o -fstype tmpfs -o -fstype
>> usbfs -o -fstype udf -o -type d -regex '\(^/tmp$\)\|\(^/usr/tmp
>> $\)\|\(^/var/tmp$\)\|\(^/afs$\)\|\(^/amd$\)\|\(^/alex$\)\|\(^/var/
>> spool$\)\|\(^/sfs$\)\|\(^/media$\)' \) -prune -o -print0
>> 2534 D - /usr/bin/find / -
>> ignore_readdir_race ( -fstype NFS -o -fstype nfs -o -fstype nfs4 -o
>> -fstype afs -o -fstype binfmt_misc -o -fstype proc -o -fstype smbfs
>> -o -fstype autofs -o -fstype iso9660 -o -fstype ncpfs -o -fstype
>> coda -o -fstype devpts -o -fstype ftpfs -o -fstype devfs -o -fstype
>> mfs -o -fstype shfs -o -fstype sysfs -o -fstype cifs -o -fstype
>> lustre_lite -o -fstype tmpfs -o -fstype usbfs -o -fstype udf -o -
>> type d -regex \(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)\|\(^/afs$\)
>> \|\(^/amd$\)\|\(^/alex$\)\|\(^/var/spool$\)\|\(^/sfs$\)\|\(^/media$
>> \) ) -prune -o -print0
>> 2712 D - sshd: root at pts/23
>> 2960 D - sshd: root at pts/24
>> 3194 D - sshd: root at pts/25
>> 3252 D - sshd: root at pts/26
>> 4097 D - sshd: root at pts/27
>> 4750 D - sshd: root at pts/28
>> 4965 D - sshd: root at pts/29
>> 5274 D - sshd: root at pts/30
>> 5685 D - sshd: root at pts/31
>> 6150 D - sshd: root at pts/32
>> 6695 D - sshd: root at pts/33
>> 7601 S - pickup -l -t fifo -
>> u -c
>> 8248 S - sshd: root at pts/34
>> 8257 S wait -bash
>> 8300 R - ps -eo
>> pid,state,wchan:40,cmd
>>