Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Again me waking up an old thread ... But since I've ruled out networking problems, things got a bit mysterious again and I'd like to gather more information. I run Virtuozzo on active/passive drbd setup, so I'm tied to the drbd module that comes with virtuozzo RHEL4 based kernel. There haven't been any changes that would touch this problem up to 0.7.25 anyway. Running virtuozzo also means I'm runing ext3 formatted with 1k block size. Hardware is the same as elsewhere in this thread - HP DL385, bnx2 nics, Smart Array raid (with battery). As our VPSes get more and more load, I'm starting to notice a pattern on how these disconnects occur. In our case they're always triggered by an IO spike, like migration of a VPS or mysql alter table over some huge table or some such. Since now I've had on-disconnect reconnect in drbd.conf. That meant that everything went into D when failure occured and automatic reconnect *never* happend. Now I have on-disconnect stand_alone, so at least things work on while I can poke at drbd on a production system ;) Observations: Drbd works fine (for weeks) *until* first lockup. IO freezes untill ko_count counts down to 0, drbd disconnects to StandAlone and the machine works on. After that any attempt to 'drbdadm connect drbd0' would start synchronisation but that would rarely finish, because the lockup would repeat very soon. Even in cases when sync finishes, lockup comes back within minutes. I wrote a quick script that reconnects drbd if it sees StandAlone in /proc/drbd, but that lead me to even more interesting state: IO was frozen, processes were piling in D but ko_count never decreased, as if everything would be perfectly normal. These are hopefully meaningful info from one such case: /proc/drbd from primary: version: 0.7.22 (api:79/proto:74) SVN Revision: 2554 build by phil at mescal, 2006-10-23 11:00:31 0: cs:Connected st:Primary/Secondary ld:Consistent ns:257962 nr:0 dw:314094432 dr:1380965795 al:47755110 bm:701268 lo:0 pe:11 ua:0 ap:10 /proc/drbd from secondary: version: 0.7.22 (api:79/proto:74) SVN Revision: 2554 build by phil at mescal, 2006-10-23 11:00:31 0: cs:Connected st:Secondary/Primary ld:Consistent ns:0 nr:257952 dw:281716413 dr:0 al:0 bm:44473 lo:0 pe:0 ua:0 ap:0 ps from primary: # ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd" 222 D - pdflush 14783 D wait_on_buffer kjournald 23254 D - mysqld 25469 D - syslogd 26924 D - nginx 31281 D - nginx 753 D - nginx 4120 D - syslogd 4566 D - nginx 15062 D - syslogd 6178 D - syslogd 23532 D - nginx 2175 D - syslogd 2449 D - php-cgi 4630 D - syslogd 10669 D - syslogd 27630 D - nginx 29669 D - syslogd 31133 D - syslogd 20906 D - syslogd 21322 D - nginx 17225 D - httpd 25121 D - sshd 25133 D - bash 27003 D - nginx 27333 D - nginx 24983 D - php-cgi 24992 D - syslogd 15155 D - php-cgi 16519 D - php-cgi 3059 D wait_on_buffer nginx 10388 D - php-cgi 5012 D - php-cgi 14661 D - php-cgi 12616 D - php-cgi 13294 D - php-cgi 3037 D - php-cgi 3040 D - php-cgi 10396 D - php-cgi 18457 D - vmstat 19394 S - drbd0_worker 20970 D - php-cgi 2470 D - php-cgi 4445 D - php-cgi 6155 D - php 14480 D - php 15121 S - drbd0_receiver 18657 S 1104414293752 drbd0_asender 20959 D - php-cgi 22648 D - php 23212 D - php 23355 D - php 24585 D - kill_dead_conne 27312 D - grep 29512 D - php 30881 D - php 4766 D - php 4933 D - php 6198 D - kill_dead_conne # netstat -tnp |grep 10.10.10 tcp 0 0 10.10.10.2:7789 10.10.10.1:33093 ESTABLISHED - tcp 0 11568 10.10.10.2:41041 10.10.10.1:7789 ESTABLISHED - On Tue, 8 Jan 2008 12:21:21 +0100 Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > on a Secondary, > if ua stays != zero, and ns,nr,dw,dr do not increase during that time, > drbd has a problem. if those ns,nr,dw,dr still increase, or ua is zero, > all is fine. > > on a Primary, > if ap or pe stays != zero, and the nfs,nrw,dw,dr do not increase, > drbd has a problem, if those ns,nrw,dw,dr do still increase, > or pe is zero, all is fine. This makes me wonder... Maybe 0.1s resolution from watch is not enough or I can't follow two files that fast, but from my impression it seems that secondary would fall under your definition of "all is fine" while primary would fall under "drbd has a problem". Maybe it is so only after secondary stopped receiving all updates and its counters were 0, while primary was waiting for something else to finish ... I found something from 2006 that gave me further thoughts: http://www.gossamer-threads.com/lists/drbd/users/10132#10132 Could this be some not-that-obvious memory/vm/tcp/buffers kernel chichken-and-egg problem? That possibly only comes up under memory pressure? These two machines have 8GB with about 4.5GB in buffers and cache and rest for apps and near 0 swap usage. I often see "Out of socket memory", but I suspect this comes up because we have VPSes configured to use sockets wherever possible instead of tcp connections. How can I dig further about how much memory is allocated for what? I did sysrq-t (which cost me cluster reboot as HA panicked because dump took a few seconds too long ;) and am attaching it here. Hopefully it will shine some light on the detective work across kernel funcions ;) Because of the reboot I now cannot trigger the problem at will ... at least not until it happens again. -- Jure Pečar http://jure.pecar.org -------------- next part -------------- A non-text attachment was scrubbed... Name: trace.gz Type: application/octet-stream Size: 18457 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080220/1b4116d3/attachment.obj>