Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
(I apologize for the dupe, I just posted this on the devel list but I intended to post this here) A weird problem hit me today: I changed the MTU on the standby node in an active/passive PG cluster based on drbd; this caused a freeze of exactly 15 min on the drbd device, during which all postgres threads couldn't commit. Any idea why the timeout was so long? Note that the two nodes are in separate locations, linked by a (currently mostly idle) 100Mbps bridge. This is the last message from postgres in /var/log/messages: Dec 2 10:20:18 alice postgres[28729]: [38440-1] 2008-12-02 10:20:18 GMT radiusdb 192.168.0.51 28729 48fff5bc.7039 SELECTLOG: duration: 0.183 ms Nothing happens for 15 mins, until this shows up: Dec 2 10:35:12 alice kernel: drbd1: sock_recvmsg returned -110 Dec 2 10:35:12 alice kernel: drbd1: peer( Secondary -> Unknown ) conn ( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknow n ) Dec 2 10:35:12 alice kernel: drbd1: short read expecting header on sock: r=-110 Dec 2 10:35:12 alice kernel: drbd1: asender terminated Dec 2 10:35:12 alice kernel: drbd1: Terminating asender thread Dec 2 10:35:12 alice kernel: drbd1: Creating new current UUID Dec 2 10:35:12 alice kernel: drbd1: Writing meta data super block now. Dec 2 10:35:12 alice kernel: drbd1: tl_clear() Dec 2 10:35:12 alice kernel: drbd1: Connection closed Dec 2 10:35:12 alice kernel: drbd1: conn( BrokenPipe -> Unconnected ) Dec 2 10:35:12 alice kernel: drbd1: receiver terminated Dec 2 10:35:12 alice kernel: drbd1: receiver (re)started This is: # uname -a Linux alkaid 2.6.18-92.1.13.el5 #1 SMP Thu Sep 4 03:51:21 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux # rpm -qa |grep drbd drbd-km-2.6.18_92.1.13.el5-8.2.6-3 drbd-8.2.6-3 Here is my drbd.conf: common { protocol C; startup { wfc-timeout 10; degr-wfc-timeout 10; } disk { on-io-error detach; } net { cram-hmac-alg "sha1"; shared-secret "xxxxxxx"; } syncer { rate 10M; verify-alg md5; } } resource rb { on alice { device /dev/drbd1; disk /dev/System/data_share; address 192.168.5.21:7789; meta-disk internal; } on bob { device /dev/drbd1; disk /dev/System/data_share; address 192.168.5.22:7789; meta-disk internal; } }