Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I run two node HP Proliant DL140 / Ubuntu 8.04 LTS / Heartbeat 2.1.3-2 cluster with bunch of KVM virtual servers. Images for virtual servers are stored in two DRBD 8.0.11 / OCFS2 1.3.9 dual-primary mode resources. Yesterday one node hung up and other one was killed probably via watchdog. After boot one resource succesfly went to " 0: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate C r---" but other one remains in Partial synchronization state: " 1: cs:WFBitMapS st:Secondary/Secondary ds:UpToDate/Outdated C r-p-" " 1: cs:WFBitMapT st:Secondary/Secondary ds:Outdated/UpToDate C r---" Neither source nor target change to SyncSource or SyncTarget as expected. Strange messages appear in /var/log/kern.log on target: root at rrhb-gandalfb:~# grep drbd /var/log/kern.log|head -n 50 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710796] drbd: initialised. Version: 8.0.11 (api:86/proto:86) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710804] drbd: GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710807] drbd: registered as block device major 147 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710808] drbd: minor_table @ 0xffff810222551900 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.732238] drbd0: disk( Diskless -> Attaching ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.732245] drbd0: Starting worker thread (from cqueue/3 [5075]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.752807] drbd0: Found 6 transactions (324 active extents) in activity log. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.752813] drbd0: max_segment_size ( = BIO size ) = 32768 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.752818] drbd0: drbd_bm_resize called with capacity == 97656016 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.753340] drbd0: resync bitmap: bits=12207002 words=190735 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.753346] drbd0: size = 46 GB (48828008 KB) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.783630] drbd0: reading of bitmap took 3 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.784968] drbd0: recounting of set bits took additional 0 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.784973] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.785036] drbd0: Marked additional 1028 MB as out-of-sync based on AL. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.915641] drbd0: disk( Attaching -> Consistent ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.915649] drbd0: Writing meta data super block now. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.937355] drbd1: disk( Diskless -> Attaching ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.937365] drbd1: Starting worker thread (from cqueue/0 [5072]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.963938] drbd1: Found 6 transactions (154 active extents) in activity log. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.963946] drbd1: max_segment_size ( = BIO size ) = 32768 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.963950] drbd1: drbd_bm_resize called with capacity == 195312160 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.965077] drbd1: resync bitmap: bits=24414020 words=381470 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.965087] drbd1: size = 93 GB (97656080 KB) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.005957] drbd1: reading of bitmap took 4 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.008637] drbd1: recounting of set bits took additional 0 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.008644] drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.008695] drbd1: Marked additional 584 MB as out-of-sync based on AL. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.095467] drbd1: disk( Attaching -> Consistent ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.095477] drbd1: Writing meta data super block now. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223170] drbd0: conn( StandAlone -> Unconnected ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223193] drbd0: Starting receiver thread (from drbd0_worker [5086]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223218] drbd0: receiver (re)started Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223222] drbd0: conn( Unconnected -> WFConnection ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.224981] drbd1: conn( StandAlone -> Unconnected ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.225003] drbd1: Starting receiver thread (from drbd1_worker [5094]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.225029] drbd1: receiver (re)started Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.225034] drbd1: conn( Unconnected -> WFConnection ) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.574743] drbd1: Handshake successful: DRBD Network Protocol version 86 Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.574749] drbd0: Handshake successful: DRBD Network Protocol version 86 Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575131] drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575142] drbd0: conn( WFConnection -> WFReportParams ) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575148] drbd0: Starting asender thread (from drbd0_receiver [5113]) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575174] drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575184] drbd1: conn( WFConnection -> WFReportParams ) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575190] drbd1: Starting asender thread (from drbd1_receiver [5116]) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.646147] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) disk( Consistent -> UpToDate ) pdsk( DUnknown -> Outdated ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.646160] drbd0: Writing meta data super block now. Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.647489] drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) disk( Consistent -> Outdated ) pdsk( DUnknown -> UpToDate ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.647501] drbd1: Writing meta data super block now. root at rrhb-gandalfb:~/tl# grep drbd /var/log/kern.log|head -n 60 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710796] drbd: initialised. Version: 8.0.11 (api:86/proto:86) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710804] drbd: GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710807] drbd: registered as block device major 147 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.710808] drbd: minor_table @ 0xffff810222551900 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.732238] drbd0: disk( Diskless -> Attaching ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.732245] drbd0: Starting worker thread (from cqueue/3 [5075]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.752807] drbd0: Found 6 transactions (324 active extents) in activity log. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.752813] drbd0: max_segment_size ( = BIO size ) = 32768 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.752818] drbd0: drbd_bm_resize called with capacity == 97656016 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.753340] drbd0: resync bitmap: bits=12207002 words=190735 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.753346] drbd0: size = 46 GB (48828008 KB) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.783630] drbd0: reading of bitmap took 3 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.784968] drbd0: recounting of set bits took additional 0 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.784973] drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.785036] drbd0: Marked additional 1028 MB as out-of-sync based on AL. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.915641] drbd0: disk( Attaching -> Consistent ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.915649] drbd0: Writing meta data super block now. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.937355] drbd1: disk( Diskless -> Attaching ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.937365] drbd1: Starting worker thread (from cqueue/0 [5072]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.963938] drbd1: Found 6 transactions (154 active extents) in activity log. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.963946] drbd1: max_segment_size ( = BIO size ) = 32768 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.963950] drbd1: drbd_bm_resize called with capacity == 195312160 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.965077] drbd1: resync bitmap: bits=24414020 words=381470 Mar 4 21:55:16 rrhb-gandalfb kernel: [ 111.965087] drbd1: size = 93 GB (97656080 KB) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.005957] drbd1: reading of bitmap took 4 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.008637] drbd1: recounting of set bits took additional 0 jiffies Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.008644] drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.008695] drbd1: Marked additional 584 MB as out-of-sync based on AL. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.095467] drbd1: disk( Attaching -> Consistent ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.095477] drbd1: Writing meta data super block now. Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223170] drbd0: conn( StandAlone -> Unconnected ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223193] drbd0: Starting receiver thread (from drbd0_worker [5086]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223218] drbd0: receiver (re)started Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.223222] drbd0: conn( Unconnected -> WFConnection ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.224981] drbd1: conn( StandAlone -> Unconnected ) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.225003] drbd1: Starting receiver thread (from drbd1_worker [5094]) Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.225029] drbd1: receiver (re)started Mar 4 21:55:16 rrhb-gandalfb kernel: [ 112.225034] drbd1: conn( Unconnected -> WFConnection ) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.574743] drbd1: Handshake successful: DRBD Network Protocol version 86 Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.574749] drbd0: Handshake successful: DRBD Network Protocol version 86 Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575131] drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575142] drbd0: conn( WFConnection -> WFReportParams ) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575148] drbd0: Starting asender thread (from drbd0_receiver [5113]) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575174] drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575184] drbd1: conn( WFConnection -> WFReportParams ) Mar 4 21:56:49 rrhb-gandalfb kernel: [ 205.575190] drbd1: Starting asender thread (from drbd1_receiver [5116]) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.646147] drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) disk( Consistent -> UpToDate ) pdsk( DUnknown -> Outdated ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.646160] drbd0: Writing meta data super block now. Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.647489] drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) disk( Consistent -> Outdated ) pdsk( DUnknown -> UpToDate ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.647501] drbd1: Writing meta data super block now. Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.718045] drbd0: peer( Secondary -> Primary ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.756389] drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.756407] drbd1: aftr_isp( 0 -> 1 ) Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.756412] drbd0: Began resync as SyncSource (will sync 2088960 KB [522240 bits set]). Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.756415] drbd0: Writing meta data super block now. Mar 4 21:56:50 rrhb-gandalfb kernel: [ 205.788553] drbd0: role( Secondary -> Primary ) Mar 4 21:57:08 rrhb-gandalfb kernel: [ 224.007713] drbd1: [drbd1_receiver/5116] sock_sendmsg time expired, ko = 4294967295 Mar 4 21:57:14 rrhb-gandalfb kernel: [ 230.000650] drbd1: [drbd1_receiver/5116] sock_sendmsg time expired, ko = 4294967294 Mar 4 21:57:20 rrhb-gandalfb kernel: [ 235.995426] drbd1: [drbd1_receiver/5116] sock_sendmsg time expired, ko = 4294967293 Mar 4 21:57:26 rrhb-gandalfb kernel: [ 241.989622] drbd1: [drbd1_receiver/5116] sock_sendmsg time expired, ko = 4294967292 I don't understand why ko starts with so high number. I suppose default ko-count 4 ??? I tried to move source to be primary but it failed: root at rrhb-gandalfs:~# drbdadm primary r1 No response from the DRBD driver! Is the module loaded? Module is loaded of course: root at rrhb-gandalfs:~/tl# /etc/init.d/drbd status drbd driver loaded OK; device status: version: 8.0.11 (api:86/proto:86) GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by phil at mescal, 2008-02-12 11:56:43 m:res cs st ds p mounted fstype 0:r0 Connected Primary/Primary UpToDate/UpToDate C /drbd0 ocfs2 1:r1 WFBitMapS Secondary/Secondary UpToDate/Outdated C Neither data discarding on target works: root at rrhb-gandalfb:~# drbdadm -- --discard-my-data connect r1 Child process does not terminate! Exiting. No response from the DRBD driver! Is the module loaded? How can I stop drbd1_receiver? I solved similar situation via reboots few months ago but I need to keep virtual servers running from drbd0 up now. Any idea, please? Thanks in advice Tomas