Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'll respond to my own post, and see if this helps someone else. While troubleshooting this mess, I tried running a dd on the drbd device on the secondary node. It told me that the device was unavailable. I tried to down the resources on the secondary node, "drbdadm down all" it gave a message about the primary device refusing the action. Sorry I did not turn on logging during this to get the exact errors. Frustrated, I just rebooted the secondary box. When I did this, I lost access to all of my iSCSI and NFS luns. A few minutes later the primary drbd server crashed. Heartbeat then failed all of the resources over to the secondary box. The secondary then started trying to sync back to the primary. This is when I panicked thinking that it was overwriting good data with bad. I immediately stopped the sync, and unplugged the sync cable. I then brought everything down on both nodes to start doing some forensics. Before the event, I looked at /proc/drbd and the primary said that it was primary and up to date, and that there were no oos. When I looked at the secondary it said that it was secondary with 40000000 oos blocks. I then used drbdadm to make each side invalidate the other copy, and just bring the device online. I looked at the primary first, since this is where everything was mounting from. Turns out its data was at minimum 4 days old. I shut it back down, and looked at the secondary. It had the latest copy of data. So I invalidated the primary data, and allowed it to sync back to primary. Everything is back to normal now, and read speeds to the drbd device are about 200+ MB/s, as I expected. I am not sure if this is possible, but my guess is that the primary node lost access to its copy of the data, and was updating the remote copy of the data. All read traffic was having to go to the remote server in order to be serviced. This makes sense in my head, but I am not sure how the drbd code is setup, and if that is even a possibility. It looks like drbd was acting correctly, and had I let it fail over, it would have done the right thing. I just was not willing to take that chance. Hope this helps someone. > Ok, this is my first post to this list, so please be easy on me. I > have setup two openfiler nodes each with a 4 drive software sata > RAID0 array that mirrors from one node to the other. There are only > the 4 drives in each host, so we partitioned off a RAID10 slice for > the boot, then we created a RAID0 slice for the DRBD data. We then > use that volume to create iSCSI and NFS shares to serve up to other > hosts. > > I have been trying to track down why my performance seems to be so > bad. I then ran across the following test, and it leaves me > scratching my head. > > On the primary server, I run a dd against the drbd1 device to just > read it in. > > dd if=/dev/drbd1 of=/dev/null bs=1M& > > I then run iostat -k 2 to check the performance. i see long periods > 2-10 secons on NO activity, then brief periods of 25-30 MB/s. I > tried disabling the remote node, and this does not improve > performance. > > If I run the same command for the underlying md2 raid disk, I get a > consistent 200-240 MB/s. I expected there to be a write penalty, > but I am scratching my head on the read penalty. By the time that > we get the iSCSI out to the clients, I am getting maybe 30MB/s, and > averaging about 15MB/s. > > Here is my drbd.conf > > global { > # minor-count 64; > # dialog-refresh 5; # 5 seconds > # disable-ip-verification; > usage-count ask; > } > > common { > syncer { rate 100M; > al-extents 257; > } > > net { > unplug-watermark 128; > } > } > > resource meta { > > protocol C; > > handlers { > pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f"; > pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f"; > local-io-error "echo O > /proc/sysrq-trigger ; halt -f"; > outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater"; > } > > startup { > # wfc-timeout 0; > degr-wfc-timeout 120; # 2 minutes. > } > > disk { > on-io-error detach; > fencing resource-only; > } > > net { > after-sb-0pri disconnect; > after-sb-1pri disconnect; > after-sb-2pri disconnect; > rr-conflict disconnect; > } > > syncer { > # rate 10M; > # after "r2"; > al-extents 257; > } > > device /dev/drbd0; > disk /dev/rootvg/meta; > meta-disk internal; > > on stg1 { > address 1.2.5.80:7788; > } > > on stg2 { > address 1.2.5.81:7788; > } > } > > resource NAS { > > protocol C; > > handlers { > pri-on-incon-degr "echo O > /proc/sysrq-trigger ; halt -f"; > pri-lost-after-sb "echo O > /proc/sysrq-trigger ; halt -f"; > local-io-error "echo O > /proc/sysrq-trigger ; halt -f"; > outdate-peer "/usr/lib64/heartbeat/drbd-peer-outdater"; > } > > startup { > wfc-timeout 0; ## Infinite! > degr-wfc-timeout 120; ## 2 minutes. > } > > disk { > on-io-error detach; > fencing resource-only; > } > > net { > # timeout 60; > # connect-int 10; > # ping-int 10; > # max-buffers 2048; > # max-epoch-size 2048; > } > > syncer { > after "meta"; > } > > device /dev/drbd1; > disk /dev/md2; > meta-disk internal; > > on stg1 { > address 1.2.5.80:7789; > } > > on stg2 { > address 1.2.5.81:7789; > } > } > > > > Any direction would be appreciated. > > Gary > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user