[DRBD-user] Verify causing kernel panic.

Wed Oct 22 20:16:16 CEST 2008

DRBD 6.2.6
Linux hp-tm-09 2.6.22.16-0.1-default #1 SMP 2008/01/23 14:28:52 UTC 
x86_64 x86_64 x86_64 GNU/Linux

Dear DRBD,

I have just suffered a kernel panic which I believe was caused by the 
end of a 'verify', and I would like your opinion as to why this 
happened.  I do have a suspicion it may be my fault.

The primary log shows the start of the verify and a list of bad sectors 
found:

Oct 21 22:31:20  drbd0: conn( Connected -> VerifyS )
Oct 22 17:07:49  drbd0: Out of sync: start=1612817344, size=24 (sectors)
Oct 22 17:07:49  drbd0: Out of sync: start=1612818000, size=16 (sectors)
Oct 22 17:07:51  drbd0: Out of sync: start=1612867664, size=24 (sectors)
Oct 22 17:07:51  drbd0: Out of sync: start=1612869240, size=24 (sectors)
Oct 22 17:07:57  drbd0: Out of sync: start=1612989568, size=16 (sectors)
Oct 22 17:08:00  drbd0: Out of sync: start=1613046616, size=16 (sectors)
Oct 22 17:08:03  drbd0: Out of sync: start=1613103760, size=24 (sectors)

After this the event handler fired and sent the email.  At this time the 
kernel panicked/crashed.

My secondary shows:

Oct 22 17:07:49  drbd0: Out of sync: start=1612817344, size=24 (sectors)
Oct 22 17:07:49  drbd0: Out of sync: start=1612818000, size=16 (sectors)
Oct 22 17:07:51  drbd0: Out of sync: start=1612867664, size=24 (sectors)
Oct 22 17:07:51  drbd0: Out of sync: start=1612869240, size=24 (sectors)
Oct 22 17:07:57  drbd0: Out of sync: start=1612989568, size=16 (sectors)
Oct 22 17:08:00  drbd0: Out of sync: start=1613046616, size=16 (sectors)
Oct 22 17:08:03  drbd0: Out of sync: start=1613103760, size=24 (sectors)
Oct 22 17:08:09  drbd0: Online verify  done (total 67008 sec; paused 0 
sec; 12036 K/sec)
Oct 22 17:08:09  drbd0: Online verify found 63225 4k block out of sync!
Oct 22 17:08:09  drbd0: helper command: /sbin/drbdadm out-of-sync
Oct 22 17:08:09  drbd0: conn( VerifyT -> Connected )
Oct 22 17:08:09  drbd0: Writing the whole bitmap, due to failed kmalloc
Oct 22 17:08:09  drbd0: writing of bitmap took 12 jiffies
Oct 22 17:08:09  drbd0: 247 MB (63225 bits) marked out-of-sync by on 
disk bit-map.
Oct 22 17:08:19  drbd0: PingAck did not arrive in time.
Oct 22 17:08:19  drbd0: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 22 17:08:19  drbd0: asender terminated
Oct 22 17:08:19  drbd0: Terminating asender thread
Oct 22 17:08:19  drbd0: short read expecting header on sock: r=-512
Oct 22 17:08:19  drbd0: Writing meta data super block now.
Oct 22 17:08:19  drbd0: tl_clear()
Oct 22 17:08:19  drbd0: Connection closed
Oct 22 17:08:19  drbd0: conn( NetworkFailure -> Unconnected )
Oct 22 17:08:19  drbd0: receiver terminated
Oct 22 17:08:19  drbd0: receiver (re)started
Oct 22 17:08:19  drbd0: role( Secondary -> Primary )
Oct 22 17:08:19  drbd0: Writing meta data super block now.
Oct 22 17:08:19  drbd0: Creating new current UUID
Oct 22 17:08:19  drbd0: Writing meta data super block now.
Oct 22 17:08:19  drbd0: conn( Unconnected -> WFConnection )

After re-boot, the primary logs show:

Oct 22 17:18:35  drbd0: disk( Diskless -> Attaching )
Oct 22 17:18:35  drbd0: Starting worker thread (from cqueue/2 [205])
Oct 22 17:18:35  drbd0: Found 6 transactions (324 active extents) in 
activity log.
Oct 22 17:18:35  drbd0: max_segment_size ( = BIO size ) = 32768
Oct 22 17:18:35  drbd0: Adjusting my ra_pages to backing device's (32 -> 
128)
Oct 22 17:18:35  drbd0: drbd_bm_resize called with capacity == 1613215170
Oct 22 17:18:35  drbd0: resync bitmap: bits=201651897 words=3150811
Oct 22 17:18:35  drbd0: size = 769 GB (806607585 KB)
Oct 22 17:18:35  drbd0: reading of bitmap took 33 jiffies
Oct 22 17:18:35  drbd0: recounting of set bits took additional 4 jiffies
Oct 22 17:18:35  drbd0: 247 MB (63225 bits) marked out-of-sync by on 
disk bit-map.
Oct 22 17:18:35  drbd0: Marked additional 1028 MB as out-of-sync based 
on AL.
Oct 22 17:18:35  drbd0: disk( Attaching -> UpToDate )
Oct 22 17:18:35  drbd0: Writing meta data super block now.
Oct 22 17:18:35  drbd0: Barriers not supported on meta data device - 
disabling
Oct 22 17:18:35  drbd0: conn( StandAlone -> Unconnected )
Oct 22 17:18:35  drbd0: Starting receiver thread (from drbd0_worker [5141])
Oct 22 17:18:35  drbd0: receiver (re)started
Oct 22 17:18:35  drbd0: conn( Unconnected -> WFConnection )
Oct 22 17:18:35  drbd0: Handshake successful: Agreed network protocol 
version 88
Oct 22 17:18:35  drbd0: conn( WFConnection -> WFReportParams )
Oct 22 17:18:35  drbd0: Starting asender thread (from drbd0_receiver [5174])
Oct 22 17:18:35  drbd0: data-integrity-alg: <not-used>
Oct 22 17:18:36  drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 22 17:18:36  drbd0: Writing meta data super block now.
Oct 22 17:18:36  drbd0: conn( WFBitMapT -> WFSyncUUID )
Oct 22 17:18:36  drbd0: helper command: /sbin/drbdadm before-resync-target
Oct 22 17:18:36  drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate 
-> Inconsistent )
Oct 22 17:18:36  drbd0: Began resync as SyncTarget (will sync 1401952 KB 
[350488 bits set]).
Oct 22 17:18:36  drbd0: Writing meta data super block now.
Oct 22 17:20:53  drbd0: Resync done (total 136 sec; paused 0 sec; 10308 
K/sec)
Oct 22 17:20:53  drbd0: conn( SyncTarget -> Connected ) disk( 
Inconsistent -> UpToDate )
Oct 22 17:20:53  drbd0: helper command: /sbin/drbdadm after-resync-target
Oct 22 17:20:53  drbd0: Writing meta data super block now.

There are no other logs reporting problems.

I believe, but would very much like an opinion, that this is because I 
removed the setting:

disk {
     size 769G;
}

Based on a bug I could not work around and a posting from Lars, subject 
"out of range error" from 2008-10-17 at 11:54.  (Copied at bottom of 
this email.)

Both my disks are slightly different sizes, this number was a 
rounding-down of the size of the smallest disk.

I could not find a protocol for removing this setting, so simply removed 
it and restarted drbd.  Everything seemed to be working perfectly.  If 
this is wrong, I am the idiot and I attribute no blame to DRBD!

Then at the end of the verify when the kernel panicked.

I am not worried that any time I run 'verify' this might happen again, 
which is why I am very interested in the opinion of DRBD as to whether 
the kernel panicked because of what I did, or whether there might be a 
fairly serious bug in 8.2.6?

Many kind regards, Ben Clewett

-----------------------------/proc/drbd-------------------------

version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by 
root at hp-tm-09, 2008-10-21 22:13:46
  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
     ns:5045036 nr:1401949 dw:6446985 dr:3951715 al:20356 bm:1594 lo:2 
pe:0 ua:0 ap:2 oos:0

-----------------------------CONFIG------------------------------

common {
     net {
         max-buffers      40000;
         unplug-watermark 40000;
         max-epoch-size   16384;
         after-sb-0pri    disconnect;
         after-sb-1pri    disconnect;
         after-sb-2pri    disconnect;
         rr-conflict      disconnect;
     }
     syncer {
         rate             10M;
         al-extents       257;
         verify-alg       crc32c;
         cpu-mask           1;
     }
     startup {
         degr-wfc-timeout 120;
     }
     handlers {
         pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
         pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
         local-io-error   "echo o > /proc/sysrq-trigger ; halt -f";
         outdate-peer     "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
         pri-lost         "echo dbms-04 pri-lost. Have a look at the log 
files. | mail -s 'DRBD Alert dbms-04' sysalerts at ........";
         split-brain      "echo dbms-04 split-brain. drbdadm -- 
--discard-my-data connect $DRBD_RESOURCE ? | mail -s 'dbms-04 DRBD 
Alert' sysalerts at .........";
         out-of-sync      "echo dbms-04 out-of-sync on resource 
$DRBD_RESOURCE. | mail -s 'dbms-04 DRBD Alert' sysalerts at ..........";
     }
}

resource dbms-04 {
     protocol               C;
     on hp-tm-09 {
         device           /dev/drbd0;
         disk             /dev/cciss/c0d0p4;
         address          192.168.95.18:7789;
         meta-disk        /dev/cciss/c0d0p3 [0];
     }
     on hp-tm-06 {
         device           /dev/drbd0;
         disk             /dev/cciss/c0d0p4;
         address          192.168.95.17:7788;
         meta-disk        /dev/cciss/c0d0p3 [0];
     }
     disk {
         on-io-error      detach;
     }
}

--------------------------------EMAIL---------------------------------

On Thu, Oct 16, 2008 at 12:04:19PM -0700, Jeffrey Froman wrote:
 > >
 > > I have a drbd resource that appears to be working correctly. However,
 > > when I run "drbdadm -d adjust <resource>" (without changing the
 > > working configuration), I get:
 > >
 > >     286744120s: out of range
 > >
 > > This number is the configured size of the drbd disk (see drbd.conf
 > > below), but:
 > >
 > > [root ~]# blockdev --getsize /dev/sdc1
 > > 286744122
 > >
 > > So why am I getting an "out of range" error for an otherwise seeminly
 > > working resource?
 > >
 > > Thank you,
 > > Jeffrey

you normally should not use that config parameter.
since we almost never do, we did not notice yet that there is a bug in
the 's' unit conversion.

simple fix: just DO NOT explicitly set a size.
it is only useful in very special cases if at all.

*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

  Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
  Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
  Registered in England No: 02017435, Registered Address: Charter Court, 
  Midland Road, Hemel Hempstead,  Hertfordshire, HP2 5GE. 
*************************************************************************