Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
DRBD 6.2.6 Linux hp-tm-09 2.6.22.16-0.1-default #1 SMP 2008/01/23 14:28:52 UTC x86_64 x86_64 x86_64 GNU/Linux Dear DRBD, I have just suffered a kernel panic which I believe was caused by the end of a 'verify', and I would like your opinion as to why this happened. I do have a suspicion it may be my fault. The primary log shows the start of the verify and a list of bad sectors found: Oct 21 22:31:20 drbd0: conn( Connected -> VerifyS ) Oct 22 17:07:49 drbd0: Out of sync: start=1612817344, size=24 (sectors) Oct 22 17:07:49 drbd0: Out of sync: start=1612818000, size=16 (sectors) Oct 22 17:07:51 drbd0: Out of sync: start=1612867664, size=24 (sectors) Oct 22 17:07:51 drbd0: Out of sync: start=1612869240, size=24 (sectors) Oct 22 17:07:57 drbd0: Out of sync: start=1612989568, size=16 (sectors) Oct 22 17:08:00 drbd0: Out of sync: start=1613046616, size=16 (sectors) Oct 22 17:08:03 drbd0: Out of sync: start=1613103760, size=24 (sectors) After this the event handler fired and sent the email. At this time the kernel panicked/crashed. My secondary shows: Oct 22 17:07:49 drbd0: Out of sync: start=1612817344, size=24 (sectors) Oct 22 17:07:49 drbd0: Out of sync: start=1612818000, size=16 (sectors) Oct 22 17:07:51 drbd0: Out of sync: start=1612867664, size=24 (sectors) Oct 22 17:07:51 drbd0: Out of sync: start=1612869240, size=24 (sectors) Oct 22 17:07:57 drbd0: Out of sync: start=1612989568, size=16 (sectors) Oct 22 17:08:00 drbd0: Out of sync: start=1613046616, size=16 (sectors) Oct 22 17:08:03 drbd0: Out of sync: start=1613103760, size=24 (sectors) Oct 22 17:08:09 drbd0: Online verify done (total 67008 sec; paused 0 sec; 12036 K/sec) Oct 22 17:08:09 drbd0: Online verify found 63225 4k block out of sync! Oct 22 17:08:09 drbd0: helper command: /sbin/drbdadm out-of-sync Oct 22 17:08:09 drbd0: conn( VerifyT -> Connected ) Oct 22 17:08:09 drbd0: Writing the whole bitmap, due to failed kmalloc Oct 22 17:08:09 drbd0: writing of bitmap took 12 jiffies Oct 22 17:08:09 drbd0: 247 MB (63225 bits) marked out-of-sync by on disk bit-map. Oct 22 17:08:19 drbd0: PingAck did not arrive in time. Oct 22 17:08:19 drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Oct 22 17:08:19 drbd0: asender terminated Oct 22 17:08:19 drbd0: Terminating asender thread Oct 22 17:08:19 drbd0: short read expecting header on sock: r=-512 Oct 22 17:08:19 drbd0: Writing meta data super block now. Oct 22 17:08:19 drbd0: tl_clear() Oct 22 17:08:19 drbd0: Connection closed Oct 22 17:08:19 drbd0: conn( NetworkFailure -> Unconnected ) Oct 22 17:08:19 drbd0: receiver terminated Oct 22 17:08:19 drbd0: receiver (re)started Oct 22 17:08:19 drbd0: role( Secondary -> Primary ) Oct 22 17:08:19 drbd0: Writing meta data super block now. Oct 22 17:08:19 drbd0: Creating new current UUID Oct 22 17:08:19 drbd0: Writing meta data super block now. Oct 22 17:08:19 drbd0: conn( Unconnected -> WFConnection ) After re-boot, the primary logs show: Oct 22 17:18:35 drbd0: disk( Diskless -> Attaching ) Oct 22 17:18:35 drbd0: Starting worker thread (from cqueue/2 [205]) Oct 22 17:18:35 drbd0: Found 6 transactions (324 active extents) in activity log. Oct 22 17:18:35 drbd0: max_segment_size ( = BIO size ) = 32768 Oct 22 17:18:35 drbd0: Adjusting my ra_pages to backing device's (32 -> 128) Oct 22 17:18:35 drbd0: drbd_bm_resize called with capacity == 1613215170 Oct 22 17:18:35 drbd0: resync bitmap: bits=201651897 words=3150811 Oct 22 17:18:35 drbd0: size = 769 GB (806607585 KB) Oct 22 17:18:35 drbd0: reading of bitmap took 33 jiffies Oct 22 17:18:35 drbd0: recounting of set bits took additional 4 jiffies Oct 22 17:18:35 drbd0: 247 MB (63225 bits) marked out-of-sync by on disk bit-map. Oct 22 17:18:35 drbd0: Marked additional 1028 MB as out-of-sync based on AL. Oct 22 17:18:35 drbd0: disk( Attaching -> UpToDate ) Oct 22 17:18:35 drbd0: Writing meta data super block now. Oct 22 17:18:35 drbd0: Barriers not supported on meta data device - disabling Oct 22 17:18:35 drbd0: conn( StandAlone -> Unconnected ) Oct 22 17:18:35 drbd0: Starting receiver thread (from drbd0_worker [5141]) Oct 22 17:18:35 drbd0: receiver (re)started Oct 22 17:18:35 drbd0: conn( Unconnected -> WFConnection ) Oct 22 17:18:35 drbd0: Handshake successful: Agreed network protocol version 88 Oct 22 17:18:35 drbd0: conn( WFConnection -> WFReportParams ) Oct 22 17:18:35 drbd0: Starting asender thread (from drbd0_receiver [5174]) Oct 22 17:18:35 drbd0: data-integrity-alg: <not-used> Oct 22 17:18:36 drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Oct 22 17:18:36 drbd0: Writing meta data super block now. Oct 22 17:18:36 drbd0: conn( WFBitMapT -> WFSyncUUID ) Oct 22 17:18:36 drbd0: helper command: /sbin/drbdadm before-resync-target Oct 22 17:18:36 drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Oct 22 17:18:36 drbd0: Began resync as SyncTarget (will sync 1401952 KB [350488 bits set]). Oct 22 17:18:36 drbd0: Writing meta data super block now. Oct 22 17:20:53 drbd0: Resync done (total 136 sec; paused 0 sec; 10308 K/sec) Oct 22 17:20:53 drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Oct 22 17:20:53 drbd0: helper command: /sbin/drbdadm after-resync-target Oct 22 17:20:53 drbd0: Writing meta data super block now. There are no other logs reporting problems. I believe, but would very much like an opinion, that this is because I removed the setting: disk { size 769G; } Based on a bug I could not work around and a posting from Lars, subject "out of range error" from 2008-10-17 at 11:54. (Copied at bottom of this email.) Both my disks are slightly different sizes, this number was a rounding-down of the size of the smallest disk. I could not find a protocol for removing this setting, so simply removed it and restarted drbd. Everything seemed to be working perfectly. If this is wrong, I am the idiot and I attribute no blame to DRBD! Then at the end of the verify when the kernel panicked. I am not worried that any time I run 'verify' this might happen again, which is why I am very interested in the opinion of DRBD as to whether the kernel panicked because of what I did, or whether there might be a fairly serious bug in 8.2.6? Many kind regards, Ben Clewett -----------------------------/proc/drbd------------------------- version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at hp-tm-09, 2008-10-21 22:13:46 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:5045036 nr:1401949 dw:6446985 dr:3951715 al:20356 bm:1594 lo:2 pe:0 ua:0 ap:2 oos:0 -----------------------------CONFIG------------------------------ common { net { max-buffers 40000; unplug-watermark 40000; max-epoch-size 16384; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 10M; al-extents 257; verify-alg crc32c; cpu-mask 1; } startup { degr-wfc-timeout 120; } handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; pri-lost "echo dbms-04 pri-lost. Have a look at the log files. | mail -s 'DRBD Alert dbms-04' sysalerts at ........"; split-brain "echo dbms-04 split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'dbms-04 DRBD Alert' sysalerts at ........."; out-of-sync "echo dbms-04 out-of-sync on resource $DRBD_RESOURCE. | mail -s 'dbms-04 DRBD Alert' sysalerts at .........."; } } resource dbms-04 { protocol C; on hp-tm-09 { device /dev/drbd0; disk /dev/cciss/c0d0p4; address 192.168.95.18:7789; meta-disk /dev/cciss/c0d0p3 [0]; } on hp-tm-06 { device /dev/drbd0; disk /dev/cciss/c0d0p4; address 192.168.95.17:7788; meta-disk /dev/cciss/c0d0p3 [0]; } disk { on-io-error detach; } } --------------------------------EMAIL--------------------------------- On Thu, Oct 16, 2008 at 12:04:19PM -0700, Jeffrey Froman wrote: > > > > I have a drbd resource that appears to be working correctly. However, > > when I run "drbdadm -d adjust <resource>" (without changing the > > working configuration), I get: > > > > 286744120s: out of range > > > > This number is the configured size of the drbd disk (see drbd.conf > > below), but: > > > > [root ~]# blockdev --getsize /dev/sdc1 > > 286744122 > > > > So why am I getting an "out of range" error for an otherwise seeminly > > working resource? > > > > Thank you, > > Jeffrey you normally should not use that config parameter. since we almost never do, we did not notice yet that there is a bug in the 's' unit conversion. simple fix: just DO NOT explicitly set a size. it is only useful in very special cases if at all. ************************************************************************* This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley, Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17 Registered in England No: 02017435, Registered Address: Charter Court, Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE. *************************************************************************