Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, May 10, 2012 at 7:15 PM, Brian Chrisman <brchrisman at gmail.com>wrote: > version: 8.3.11 (api:88/proto:86-96) > May 10 21:11:55 scale-192-168-54-14 kernel: block drbd1: > [drbd1_worker/8936] sock_sendmsg time expired, ko = 4294967295 > ... (long countdown that will never get anywhere) > This is a "dual primary" setup (underneath GPFS) over a failover-bonded > network interface. > Everything works fine (read/write/reboot/etc) until I attempt a verify. > > My configuration has no reference to ko-count, which from the > documentation suggests it should be 0 and be disabled. Does the > documentation actually intend to say that the default is 2^32? > I'm building/running this all on a clone of RHEL6.2. > > This is occurring during an attempt to 'verify' a dual primary DRBD > device. Originally I received this message on every attempt at verify, but > after I reduced syncer { rate }, this message only props up after a few > iterations. There is no network/connectivity problem during this time > period, yet drbd commands hang such as: > > Apologies for repost, but I think I've cleared up my out-of-sync messages. I'm bypassing the initial sync (throwing away data on a second drive), and it looks like the preexisting-but-abandoned data is hosing up the verify (I checked by zeroing all blocks before creating the device in the same fashion, after which the verify works fine). I bypass this because I'm using dual-primary and to get into dual-primary the sync must be finished (with 2TB SATA drives, takes a *long* time). I'm trying to think of ways to get around this issue but it's a tough one. -brian strace -f drbdsetup 1 disconnect --force > ... > stat("/proc/drbd", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > open("/var/lock/drbd-147-1", O_RDWR|O_CREAT, 0600) = 3 > rt_sigaction(SIGALRM, {0x406b30, [], SA_RESTORER, 0x3935232900}, {SIG_DFL, > [], 0}, 8) = 0 > alarm(1) = 0 > fcntl(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 > alarm(0) = 1 > rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x3935232900}, NULL, 8) = > 0 > socket(PF_NETLINK, SOCK_DGRAM, 11) = 4 > getpid() = 13360 > bind(4, {sa_family=AF_NETLINK, pid=13360, groups=ffffffff}, 12) = 0 > sendto(4, > "9\0\0\0\3\0\0\0\1\0\0\00004\0\0\4\0\0\0\1\0\0\0\1\0\0\00004\0\0"..., 57, > 0, NULL, 0) = 57 > poll([{fd=4, events=POLLIN}], 1, 120000 > << this is where it hangs and exits after a terminate (ctrl-c) >> > All that's going on in the dmesg output is sock_sendmsg expiration reports. > > > The documentation here also would be better if *count *and *number* were > consistent (either 'count' or 'number'). > >> ko-count *number >> *In case the secondary node fails to complete a single write request for >> *count* times the *timeout*, it is expelled from the cluster. (I.e. the >> primary node goes into StandAlone mode.) The default value is 0, which >> disables this feature. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120511/ed6fe294/attachment.htm>