Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 6/26/06, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote: > / 2006-06-24 12:33:38 +0200 > \ Andreas Schader: > > I found out, that when I reboot both nodes with "shutdown -r now" at > > the same time the syncing starts after both are up again and soon > > after that secondary goes back to "Consistent" in /proc/drbd. > > is this a dedicated replication link? yes, each machine has two NICs, one for the NFS lan and one connected with a crossover cable with 1Gbit. > as a workaround for whatever the real problem is with your setup, > try to minimize nfs-activity, and do <...> I minimised io load and nfs traffic by turning off the nfs-kernel-server and doing only some local testing on the primary node. Here is what I did to get drbd to stop working: # create a file on the primary drbd node [root at nas1:/data1]# echo hello > /data1/testfile1 # unplug the network cable of the crossover link # to simulate a network failure # make changes to the primary file system while the secondary is not syncing [root at nas1:/data1]# echo hello > /data1/testfile2 # reconnect the network cable # after just a few bytes changed on disk secondary goes Inconsistent [root at nas2:~]# cat /proc/drbd version: 0.7.18 (api:78/proto:74) SVN Revision: 2176 build by root at nas2, 2006-06-22 22:05:30 0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent ns:0 nr:1623114 dw:1623114 dr:0 al:0 bm:572 lo:0 pe:0 ua:0 ap:0 # some more disk activity on primary while secondary is Inconsistent [root at nas1:/data1]# echo hello > /data1/testfile3 [root at nas1:/data1]# echo hello > /data1/testfile4 # the testfile4 echo already hangs and never returns to the prompt # to get primary working again I disconnect the resources # this causes primary to finish the testfile4 echo and return to the prompt [root at nas2:~]# drbdadm disconnect all # now I try the suggested workaround [root at nas1:/data1]# perl -e '$x = "X" x (500*1024*1024)' [root at nas2:~]# perl -e '$x = "X" x (500*1024*1024)' [root at nas2:~]# drbdadm connect all [root at nas2:~]# cat /proc/drbd version: 0.7.18 (api:78/proto:74) SVN Revision: 2176 build by root at nas2, 2006-06-22 22:05:30 0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent ns:0 nr:0 dw:1623114 dr:0 al:0 bm:572 lo:0 pe:0 ua:0 ap:0 but secondary remains inconsistent and still prints thousands of drbd0: [drbd0_receiver/9922] sock_sendmsg time expired, ko = 4294967281 lines in syslog. after I rebooted both nodes it was working again. > some technical background: I skiped the io system analysis for now, because I don't think this is causing the problems because it can be simulated with very small changes to the filesystem which shouldn't have an impact on the performance of the disks. And to be honest I lack the experience with the tools you suggested to know what I am looking for anyway ;-) I will try to get hold of some other hardware and will try to test it with smaller drbd devices. But in the meantime any more thoughts on this would be really appreciated. best regards, Andreas