Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello! I'm testing a behavior in different fail-over scenarios and got a very weird problem. I have two servers with kernel built from sources 2.6.22.1 with precompiled DRBD module (make patch-kernel), tested with both DRBD-8.0.4 and DRBD-8.0.24. DRBD configuration is identical and follows below. Two nodes are up and running after reboot, disks are in Secondary/Secondary and UpToDate/UpToDate state. Then I do the following: node1# drbdadm primary r0 node1# mount /dev/drbd0 /mnt node1# dd if=/dev/urandom of=/mnt/randomfile bs=1M count=2048 In the middle I power this node off and keep it off all the time. And do on the second node: node2# drbdadm primary r0 (It works fine because the state was cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown) And here I get the problem. It shows node2# fsck.ext3 /dev/drbd0 e2fsck 1.40-WIP (14-Nov-2006) /dev/drbd0: recovering journal and it freezes (no disk activity). Kill -9 doesn't work, reboot doesn't work, any attempt to run "sync" freezes the "sync". If I run the command dd if=/dev/drbd0 of=/dev/null bs=1M before the fsck it reads well the whole disk. But if I run it at the same time with fsck then it freezes somewhere at the middle of the process (I was able to read at least the first GB of the disk). Then I reset the server and try to repeat the steps on the second node. Result is the same. After reset I run fsck on the low-level disk sdb1 and it works fine, without any delays. But after that if I mount the file system through drbd0 device, at some point disk operations stuck again (I suspect when they touch some area). I'm able to reproduce the problem easily. Without any load during "crash of a Primary" failover works as expected. I use corresponding versions of a module and tools, of course. I'm going to do the same with another kernel version like 2.6.18.8. What am I doing wrong? #------------------------------------------------ resource r0 { protocol C; on node2 { device /dev/drbd0; disk /dev/sdb1; address 192.168.0.212:7788; meta-disk /dev/sdb2 [0]; } on node1 { device /dev/drbd0; disk /dev/sdb1; address 192.168.0.211:7788; meta-disk /dev/sdb2 [0]; } net { sndbuf-size 1m; ko-count 16; cram-hmac-alg sha1; shared-secret testtest; after-sb-0pri discard-older-primary; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; rr-conflict violently; } disk { on-io-error pass_on; } syncer { rate 500M; al-extents 103; } startup { wfc-timeout 5; degr-wfc-timeout 5; } handlers { pri-on-incon-degr "echo DRDB pri-on-incon-degr | wall"; pri-lost-after-sb "echo DRBD pri-lost-after-sb | wall"; local-io-error "echo DRDB local-io-error | wall"; } } #------------------------------------------------ -- Igor