Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> Hi Lars, > > thanks for your reply. > > I have attached the requested information (I could not retrieve the > output from /proc/sysrq-trigger, because I only have a ssh > connection). > > >>> The question is if that could be related to DRBD? >> >> hard to say yes or no without more information >> about your setup and the nature of your stress tests. >> you have to investigate that your self, I guess. >> >> drbd status during such periods? >> >> drbd messages or other "interessting" messages? >> >> does it help to disconnect (physically, if necessary) drbd? >> >> does it also happen with disconnected drbd (StandAlone)? > > > I was running dbench for a duration of 7 hours on the primary node. > After about half of that time, the described soft lockups occured. I > managed to reboot the machine remotely using /proc/sysrq-trigger. > While I was trying to check / mount the XFS filesystem on the > secondary, I received similar softlockup errors that rendered the > machine unusable and required a hard reboot. There are no unusual > DRBD messages in the logfiles. > > Unfortunately, I neither have physical access to the machines nor > time for testing, as there is a rather short timeframe for going > productive. So I updated the kernel to 2.6.24-6 ("Debian Etch and a > half") and manually compiled a newer version of my raid controller > driver (megaraid_sas). Additionaly, I updated to DRBD 8.0.14, as > this version had become available in Debian backports. I couldn't > reproduce the error since then. > > >>> I'm getting more and more convinced, that this issue is due to the >>> "certified" scsi driver not working properly, >> >> why is that? > > Well, I'm using a slightly altered version of Debian Etch, which has > been released for the specific hardware I use (FSC Primergy RX300S4 > with LSI 1078 RAID controller). However, the megaraid_sas driver for > that specific controller is the regular Etch version, which in turn > is taken from vanilla 2.6.18-6 and apparently has not been changed > since. This version is over two years old and is known to have > sporadic problems under heavy i/o, causing all kinds of symptoms in > layers on top of the block device driver (i.e. filesystem errors). > > To conclude: I hope to have solved this issue for now, and it > appears NOT to be related to DRBD. If I get any contradictory > information, I'll get back to you (hopefully not ;-) ). > > Thanks again! > > Thomas > > > ------------------------------------------------------------ > ps -eo pid,state,wchan:40,cmd: > >> PID S WCHAN CMD >> 1 S - init [2] >> 2 S migration_thread [migration/0] >> 3 S ksoftirqd [ksoftirqd/0] >> 4 S watchdog [watchdog/0] >> 5 S migration_thread [migration/1] >> 6 S ksoftirqd [ksoftirqd/1] >> 7 S watchdog [watchdog/1] >> 8 S migration_thread [migration/2] >> 9 S ksoftirqd [ksoftirqd/2] >> 10 S watchdog [watchdog/2] >> 11 S migration_thread [migration/3] >> 12 S ksoftirqd [ksoftirqd/3] >> 13 S watchdog [watchdog/3] >> 14 S migration_thread [migration/4] >> 15 S ksoftirqd [ksoftirqd/4] >> 16 S watchdog [watchdog/4] >> 17 S migration_thread [migration/5] >> 18 R - [ksoftirqd/5] >> 19 R - [watchdog/5] >> 20 R - [migration/6] >> 21 R - [ksoftirqd/6] >> 22 R - [watchdog/6] >> 23 S migration_thread [migration/7] >> 24 S ksoftirqd [ksoftirqd/7] >> 25 S watchdog [watchdog/7] >> 26 S worker_thread [events/0] >> 27 S worker_thread [events/1] >> 28 S worker_thread [events/2] >> 29 S worker_thread [events/3] >> 30 S worker_thread [events/4] >> 31 R - [events/5] >> 32 R - [events/6] >> 33 S worker_thread [events/7] >> 34 S worker_thread [khelper] >> 35 S worker_thread [kthread] >> 46 S worker_thread [kblockd/0] >> 47 S worker_thread [kblockd/1] >> 48 S worker_thread [kblockd/2] >> 49 S worker_thread [kblockd/3] >> 50 S worker_thread [kblockd/4] >> 51 S worker_thread [kblockd/5] >> 52 S worker_thread [kblockd/6] >> 53 S worker_thread [kblockd/7] >> 54 S worker_thread [kacpid] >> 210 S hub_thread [khubd] >> 212 S serio_thread [kseriod] >> 292 D - [pdflush] >> 293 D - [pdflush] >> 294 S kswapd [kswapd0] >> 295 S worker_thread [aio/0] >> 296 S worker_thread [aio/1] >> 297 S worker_thread [aio/2] >> 298 S worker_thread [aio/3] >> 299 S worker_thread [aio/4] >> 300 S worker_thread [aio/5] >> 301 S worker_thread [aio/6] >> 302 S worker_thread [aio/7] >> 861 S scsi_error_handler [scsi_eh_0] >> 1147 S worker_thread [ata/0] >> 1148 S worker_thread [ata/1] >> 1149 S worker_thread [ata/2] >> 1150 S worker_thread [ata/3] >> 1151 S worker_thread [ata/4] >> 1152 S worker_thread [ata/5] >> 1153 S worker_thread [ata/6] >> 1154 S worker_thread [ata/7] >> 1155 S worker_thread [ata_aux] >> 1170 D - [scsi_eh_1] >> 1171 S scsi_error_handler [scsi_eh_2] >> 1465 S kjournald [kjournald] >> 1703 S - udevd --daemon >> 2143 S worker_thread [kpsmoused] >> 2515 S worker_thread [cqueue/0] >> 2516 D drbd_req_state [cqueue/1] >> 2517 S worker_thread [cqueue/2] >> 2518 S worker_thread [cqueue/3] >> 2519 S worker_thread [cqueue/4] >> 2520 S worker_thread [cqueue/5] >> 2521 S worker_thread [cqueue/6] >> 2522 S worker_thread [cqueue/7] >> 2558 S worker_thread [kmirrord] >> 2583 S worker_thread [kcryptd/0] >> 2584 S worker_thread [kcryptd/1] >> 2585 S worker_thread [kcryptd/2] >> 2586 S worker_thread [kcryptd/3] >> 2587 S worker_thread [kcryptd/4] >> 2588 S worker_thread [kcryptd/5] >> 2589 S worker_thread [kcryptd/6] >> 2590 S worker_thread [kcryptd/7] >> 2625 S kjournald [kjournald] >> 2758 S - /sbin/portmap >> 2961 S - /sbin/syslogd >> 2967 S syslog /sbin/klogd -x >> 3070 S - /usr/sbin/acpid - >> c /etc/acpi/events -s /var/run/acpid.socket >> 3215 S - /usr/sbin/inetd >> 3277 S - /usr/lib/postfix/ >> master >> 3284 S - qmgr -l -t fifo -u >> 3287 S - /usr/sbin/snmpd - >> Lsd -Lf /dev/null -u snmp -p /var/run/snmpd.pid 127.0.0.1 >> 3293 S - /usr/sbin/sshd >> 3330 S - /sbin/rpc.statd >> 3475 S stext /opt/SMAW/RAID/ >> amDaemon >> 3490 S - [drbd0_worker] >> 3505 R - [drbd1_worker] >> 3531 R - [drbd1_receiver] >> 3542 R - [drbd1_asender] >> 3555 S - /opt/SMAW/RAID/ >> amDaemon >> 3560 S - /usr/sbin/atd >> 3568 S - /usr/sbin/cron >> 4197 S - /sbin/getty 38400 >> tty2 >> 4198 S - /sbin/getty 38400 >> tty3 >> 4199 S - /sbin/getty 38400 >> tty4 >> 4200 S - /sbin/getty 38400 >> tty5 >> 4201 S - /sbin/getty 38400 >> tty6 >> 5756 S - /sbin/getty 38400 >> tty1 >> 7396 S worker_thread [xfslogd/0] >> 7397 S worker_thread [xfslogd/1] >> 7398 S worker_thread [xfslogd/2] >> 7399 S worker_thread [xfslogd/3] >> 7400 S worker_thread [xfslogd/4] >> 7401 R - [xfslogd/5] >> 7402 S worker_thread [xfslogd/6] >> 7403 S worker_thread [xfslogd/7] >> 7404 S worker_thread [xfsdatad/0] >> 7405 S worker_thread [xfsdatad/1] >> 7406 S worker_thread [xfsdatad/2] >> 7407 S worker_thread [xfsdatad/3] >> 7408 S worker_thread [xfsdatad/4] >> 7409 S worker_thread [xfsdatad/5] >> 7410 S worker_thread [xfsdatad/6] >> 7411 S worker_thread [xfsdatad/7] >> 7454 S - ha_logd: read >> process >> 7462 S - ha_logd: write >> process >> 7581 S 562640683272 heartbeat: master >> control process >> 7584 S pipe_wait heartbeat: FIFO >> reader >> 7585 S 279172841735 heartbeat: write: >> bcast eth1 >> 7586 S - heartbeat: read: >> bcast eth1 >> 7587 S 279172841735 heartbeat: write: >> mcast eth0 >> 7588 S - heartbeat: read: >> mcast eth0 >> 7589 S 279172874239 heartbeat: write: >> serial /dev/ttyS0 >> 7590 S - heartbeat: read: >> serial /dev/ttyS0 >> 7609 S - /usr/lib64/ >> heartbeat/dopd >> 1514 S stext /usr/sbin/eecd >> 1518 S - /etc/srvmagt/SCS/ >> SVRemoteConnector -ci/etc/srvmagt/SCS/Provider/xml -ssl_servcert=/ >> etc/srvmagt/SCS/ssl/server_key.crt -ssl_capath=/etc/srvmagt/SCS/ssl >> 1521 S - /usr/sbin/scagt >> 1523 S stext /usr/sbin/sc2agt >> 1525 S - /usr/sbin/busagt >> 1527 D blk_execute_rq /usr/sbin/hdagt >> 1530 S - /usr/sbin/unixagt >> 1532 S - /usr/sbin/etheragt >> 1534 S - /usr/sbin/biosagt >> 1536 S - /usr/sbin/securagt >> 1538 S stext /usr/sbin/statusagt >> 1540 S - /usr/sbin/invagt >> 1542 S stext /usr/sbin/vvagt >> 1786 S - /usr/sbin/openvpn >> --writepid /var/run/openvpn.client.pid --daemon ovpn-client -- >> status /var/run/openvpn.client.status 10 --cd /etc/openvpn -- >> config /etc/openvpn/client.conf >> 7088 S - [xfsbufd] >> 7094 D drbd_al_begin_io [xfsbufd] >> 7095 D - [xfssyncd] >> 8333 S - /usr/bin/vmnet- >> bridge -d /var/run/vmnet-bridge-0.pid -n 0 -i eth0 >> 8598 S 11371755398898876679 /usr/sbin/vmware- >> authdlauncher >> 8608 S wait /bin/sh /usr/bin/ >> vmware-watchdog -s webAccess -u 30 -q 5 /usr/lib/vmware/webAccess/ >> java/jre1.5.0_15/bin/webAccess -client -Xmx64m - >> XX:MinHeapFreeRatio=30 -XX:MaxHeapFreeRatio=30 - >> Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager - >> Djava.endorsed.dirs=/usr/lib/vmware/webAccess/tomcat/apache- >> tomcat-6.0.16/common/endorsed -classpath /usr/lib/vmware/webAccess/ >> tomcat/apache-tomcat-6.0.16/bin/bootstrap.jar:/usr/lib/vmware/ >> webAccess/tomcat/apache-tomcat-6.0.16/bin/commons-logging-api.jar - >> Dcatalina.base=/usr/lib/vmware/webAccess/tomcat/apache- >> tomcat-6.0.16 -Dcatalina.home=/usr/lib/vmware/webAccess/tomcat/ >> apache-tomcat-6.0.16 -Djava.io.tmpdir=/usr/lib/vmware/webAccess/ >> tomcat/apache-tomcat-6.0.16/temp >> org.apache.catalina.startup.Bootstrap start >> 8617 S stext /usr/lib/vmware/ >> webAccess/java/jre1.5.0_15/bin/webAccess -client -Xmx64m - >> XX:MinHeapFreeRatio=30 -XX:MaxHeapFreeRatio=30 - >> Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager - >> Djava.endorsed.dirs=/usr/lib/vmware/webAccess/tomcat/apache- >> tomcat-6.0.16/common/endorsed -classpath /usr/lib/vmware/webAccess/ >> tomcat/apache-tomcat-6.0.16/bin/bootstrap.jar:/usr/lib/vmware/ >> webAccess/tomcat/apache-tomcat-6.0.16/bin/commons-logging-api.jar - >> Dcatalina.base=/usr/lib/vmware/webAccess/tomcat/apache- >> tomcat-6.0.16 -Dcatalina.home=/usr/lib/vmware/webAccess/tomcat/ >> apache-tomcat-6.0.16 -Djava.io.tmpdir=/usr/lib/vmware/webAccess/ >> tomcat/apache-tomcat-6.0.16/temp >> org.apache.catalina.startup.Bootstrap start >> 8782 Z exit [vmware-vmx] >> <defunct> >> 8801 S rtc_read [vmware-rtc] >> 16026 D - dbench -t 25000 20 >> 16027 D - dbench -t 25000 20 >> 16028 D - dbench -t 25000 20 >> 16029 R - dbench -t 25000 20 >> 16030 D - dbench -t 25000 20 >> 16031 D - dbench -t 25000 20 >> 16032 D - dbench -t 25000 20 >> 16033 D - dbench -t 25000 20 >> 16034 D - dbench -t 25000 20 >> 16035 D - dbench -t 25000 20 >> 16036 D - dbench -t 25000 20 >> 16037 D - dbench -t 25000 20 >> 16038 D - dbench -t 25000 20 >> 16039 D - dbench -t 25000 20 >> 16040 D - dbench -t 25000 20 >> 16041 D - dbench -t 25000 20 >> 16042 D - dbench -t 25000 20 >> 16043 D - dbench -t 25000 20 >> 16044 D - dbench -t 25000 20 >> 16045 D - dbench -t 25000 20 >> 25562 D - ls --color=auto -la >> 25711 D flush_cpu_workqueue vmrun -T server -h https://127.0.0.1:8333/sdk >> -u "" -p "" stop [standard] testvm2/testvm.vmx soft >> 25712 D - sshd: root at pts/3 >> 26119 D - sshd: root at pts/4 >> 26372 D - sshd: root at pts/5 >> 26498 D - sshd: root at pts/6 >> 26678 D - sshd: root at pts/7 >> 27455 D - sshd: root at pts/8 >> 28562 D - [bash] >> 28691 D - shutdown -r 0 w >> 29107 D - umount /srv/vmware/ >> virtual_machines >> 29109 D - sshd: root at pts/12 >> 29264 D - sshd: root at pts/13 >> 29580 D - sshd: root at pts/14 >> 30031 D - sshd: root at pts/15 >> 30366 D - sshd: root at pts/16 >> 30555 D - sshd: root at pts/17 >> 30825 D - [drbdsetup] >> 31001 D - sshd: root at pts/19 >> 31575 D - sshd: root at pts/20 >> 31815 D - sshd: root at pts/21 >> 32168 D - sshd: root at pts/22 >> 2473 S pipe_wait /USR/SBIN/CRON >> 2474 S wait /bin/sh -c test - >> x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) >> 2477 S wait /bin/sh -c test - >> x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) >> 2478 S - run-parts -- >> report /etc/cron.daily >> 2507 S wait /bin/sh /etc/ >> cron.daily/find >> 2509 S wait /bin/sh /usr/bin/ >> updatedb >> 2525 S wait /bin/sh /usr/bin/ >> updatedb >> 2528 S pipe_wait /usr/bin/sort -z -f >> 2529 S pipe_wait /usr/lib/locate/ >> frcode -0 >> 2533 S wait su nobody -s /bin/ >> sh -c /usr/bin/find / -ignore_readdir_race \( -fstype NFS -o - >> fstype nfs -o -fstype nfs4 -o -fstype afs -o -fstype binfmt_misc -o >> -fstype proc -o -fstype smbfs -o -fstype autofs -o -fstype iso9660 - >> o -fstype ncpfs -o -fstype coda -o -fstype devpts -o -fstype ftpfs - >> o -fstype devfs -o -fstype mfs -o -fstype shfs -o -fstype sysfs -o - >> fstype cifs -o -fstype lustre_lite -o -fstype tmpfs -o -fstype >> usbfs -o -fstype udf -o -type d -regex '\(^/tmp$\)\|\(^/usr/tmp >> $\)\|\(^/var/tmp$\)\|\(^/afs$\)\|\(^/amd$\)\|\(^/alex$\)\|\(^/var/ >> spool$\)\|\(^/sfs$\)\|\(^/media$\)' \) -prune -o -print0 >> 2534 D - /usr/bin/find / - >> ignore_readdir_race ( -fstype NFS -o -fstype nfs -o -fstype nfs4 -o >> -fstype afs -o -fstype binfmt_misc -o -fstype proc -o -fstype smbfs >> -o -fstype autofs -o -fstype iso9660 -o -fstype ncpfs -o -fstype >> coda -o -fstype devpts -o -fstype ftpfs -o -fstype devfs -o -fstype >> mfs -o -fstype shfs -o -fstype sysfs -o -fstype cifs -o -fstype >> lustre_lite -o -fstype tmpfs -o -fstype usbfs -o -fstype udf -o - >> type d -regex \(^/tmp$\)\|\(^/usr/tmp$\)\|\(^/var/tmp$\)\|\(^/afs$\) >> \|\(^/amd$\)\|\(^/alex$\)\|\(^/var/spool$\)\|\(^/sfs$\)\|\(^/media$ >> \) ) -prune -o -print0 >> 2712 D - sshd: root at pts/23 >> 2960 D - sshd: root at pts/24 >> 3194 D - sshd: root at pts/25 >> 3252 D - sshd: root at pts/26 >> 4097 D - sshd: root at pts/27 >> 4750 D - sshd: root at pts/28 >> 4965 D - sshd: root at pts/29 >> 5274 D - sshd: root at pts/30 >> 5685 D - sshd: root at pts/31 >> 6150 D - sshd: root at pts/32 >> 6695 D - sshd: root at pts/33 >> 7601 S - pickup -l -t fifo - >> u -c >> 8248 S - sshd: root at pts/34 >> 8257 S wait -bash >> 8300 R - ps -eo >> pid,state,wchan:40,cmd >>