Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, Having a strange behavior using drbd 8.4.5. After restart one partition of two becomes diskless. Steps to reproduce: $ sudo service drbd status > > drbd driver loaded OK; device status: > version: 8.4.5 (api:1/proto:86-101) > GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, > 2014-10-28 10:32:53 > m:res cs ro ds p mounted > fstype > 0:main Connected Primary/Secondary UpToDate/UpToDate C > /home/www/storage ext4 > ... sync'ed: 0.1% (1048376/1048540)M > 1:backup SyncSource Primary/Secondary UpToDate/Inconsistent C > /home/www/backup.looplr.com ext4 Ok, backup is syncing after same fail before. Now trying to restart drbd service: > $ sudo service drbd restart > Stopping all DRBD resources: > . > Starting DRBD resources: [ > create res: backup main > prepare disk: backup main > adjust disk: backup:failed(apply-al:255) main > adjust net: backup main > ] > . And bam, disk is diskless: > $ sudo service drbd status > drbd driver loaded OK; device status: > version: 8.4.5 (api:1/proto:86-101) > GIT-hash: 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, > 2014-10-28 10:32:53 > m:res cs ro ds p > mounted fstype > 0:main Connected Secondary/Secondary UpToDate/UpToDate C > 1:backup Connected Secondary/Secondary Diskless/Inconsistent C And now the log (backup is drbd1) > Jan 24 14:42:39 de kernel: block drbd0: role( Primary -> Secondary ) Jan 24 14:42:39 de kernel: block drbd0: bitmap WRITE of 0 pages took 0 > jiffies Jan 24 14:42:39 de kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by > on disk bit-map. Jan 24 14:42:39 de kernel: drbd main: peer( Secondary -> Unknown ) conn( > Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Jan 24 14:42:39 de kernel: drbd main: asender terminated Jan 24 14:42:39 de kernel: drbd main: Terminating drbd_a_main Jan 24 14:42:39 de kernel: drbd main: Connection closed Jan 24 14:42:39 de kernel: drbd main: conn( Disconnecting -> StandAlone ) Jan 24 14:42:39 de kernel: drbd main: receiver terminated Jan 24 14:42:39 de kernel: drbd main: Terminating drbd_r_main Jan 24 14:42:39 de kernel: block drbd0: disk( UpToDate -> Failed ) Jan 24 14:42:39 de kernel: block drbd0: bitmap WRITE of 0 pages took 0 > jiffies Jan 24 14:42:39 de kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by > on disk bit-map. Jan 24 14:42:39 de kernel: block drbd0: disk( Failed -> Diskless ) Jan 24 14:42:39 de kernel: block drbd0: drbd_bm_resize called with capacity > == 0 Jan 24 14:42:39 de kernel: drbd main: Terminating drbd_w_main Jan 24 14:42:39 de kernel: block drbd1: role( Primary -> Secondary ) Jan 24 14:42:39 de kernel: drbd backup: peer( Secondary -> Unknown ) conn( > SyncSource -> Disconnecting ) Jan 24 14:42:39 de kernel: drbd backup: asender terminated Jan 24 14:42:39 de kernel: drbd backup: Terminating drbd_a_backup Jan 24 14:42:39 de kernel: drbd backup: Connection closed Jan 24 14:42:39 de kernel: drbd backup: conn( Disconnecting -> StandAlone ) Jan 24 14:42:39 de kernel: drbd backup: receiver terminated Jan 24 14:42:39 de kernel: drbd backup: Terminating drbd_r_backup Jan 24 14:42:39 de kernel: block drbd1: disk( UpToDate -> Failed ) Jan 24 14:42:39 de kernel: block drbd1: bitmap WRITE of 0 pages took 1 > jiffies Jan 24 14:42:39 de kernel: block drbd1: 1024 GB (268383732 bits) marked > out-of-sync by on disk bit-map. Jan 24 14:42:39 de kernel: block drbd1: disk( Failed -> Diskless ) Jan 24 14:42:39 de kernel: block drbd1: drbd_bm_resize called with capacity > == 0 Jan 24 14:42:39 de kernel: drbd backup: Terminating drbd_w_backup Jan 24 14:42:39 de kernel: drbd: module cleanup done. Jan 24 14:42:39 de kernel: drbd: events: mcg drbd: 2 Jan 24 14:42:39 de kernel: drbd: initialized. Version: 8.4.5 > (api:1/proto:86-101) Jan 24 14:42:39 de kernel: drbd: GIT-hash: > 1d360bde0e095d495786eaeb2a1ac76888e4db96 build by phil at Build64R6, > 2014-10-28 10:32:53 Jan 24 14:42:39 de kernel: drbd: registered as block device major 147 Jan 24 14:42:39 de kernel: drbd main: Starting worker thread (from > drbdsetup-84 [29262]) Jan 24 14:42:39 de kernel: block drbd0: disk( Diskless -> Attaching ) Jan 24 14:42:39 de kernel: drbd main: Method to ensure write ordering: flush Jan 24 14:42:39 de kernel: block drbd0: max BIO size = 327680 Jan 24 14:42:39 de kernel: block drbd0: drbd_bm_resize called with capacity > == 1825502744 Jan 24 14:42:39 de kernel: block drbd0: resync bitmap: bits=228187843 > words=3565436 pages=6964 Jan 24 14:42:39 de kernel: block drbd0: size = 870 GB (912751372 KB) Jan 24 14:42:40 de kernel: block drbd0: recounting of set bits took > additional 12 jiffies Jan 24 14:42:40 de kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by > on disk bit-map. Jan 24 14:42:40 de kernel: block drbd0: disk( Attaching -> UpToDate ) Jan 24 14:42:40 de kernel: block drbd0: attached to UUIDs > C0027C4A2E905388:0000000000000000:2CE80910F236732B:2CE70910F236732B Jan 24 14:42:40 de kernel: drbd backup: Starting worker thread (from > drbdsetup-84 [29265]) Jan 24 14:42:40 de kernel: drbd backup: conn( StandAlone -> Unconnected ) Jan 24 14:42:40 de kernel: drbd backup: Starting receiver thread (from > drbd_w_backup [29267]) Jan 24 14:42:40 de kernel: drbd backup: receiver (re)started Jan 24 14:42:40 de kernel: drbd backup: conn( Unconnected -> WFConnection ) Jan 24 14:42:40 de kernel: drbd main: conn( StandAlone -> Unconnected ) Jan 24 14:42:40 de kernel: drbd main: Starting receiver thread (from > drbd_w_main [29264]) Jan 24 14:42:40 de kernel: drbd main: receiver (re)started Jan 24 14:42:40 de kernel: drbd main: conn( Unconnected -> WFConnection ) Jan 24 14:42:40 de kernel: drbd main: Handshake successful: Agreed network > protocol version 101 Jan 24 14:42:40 de kernel: drbd main: Agreed to support TRIM on protocol > level Jan 24 14:42:40 de kernel: drbd main: conn( WFConnection -> WFReportParams ) Jan 24 14:42:40 de kernel: drbd main: Starting asender thread (from > drbd_r_main [29272]) Jan 24 14:42:40 de kernel: drbd backup: Handshake successful: Agreed > network protocol version 101 Jan 24 14:42:40 de kernel: drbd backup: Agreed to support TRIM on protocol > level Jan 24 14:42:40 de kernel: drbd backup: conn( WFConnection -> > WFReportParams ) Jan 24 14:42:40 de kernel: drbd backup: Starting asender thread (from > drbd_r_backup [29269]) Jan 24 14:42:40 de kernel: block drbd1: max BIO size = 4096 Jan 24 14:42:40 de kernel: block drbd0: drbd_sync_handshake: Jan 24 14:42:40 de kernel: block drbd1: peer( Unknown -> Secondary ) conn( > WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent ) Jan 24 14:42:40 de kernel: block drbd0: self > C0027C4A2E905388:0000000000000000:2CE80910F236732B:2CE70910F236732B bits:0 > flags:0 Jan 24 14:42:40 de kernel: block drbd0: peer > C0027C4A2E905388:0000000000000000:2CE80910F236732A:2CE70910F236732B bits:0 > flags:0 Jan 24 14:42:40 de kernel: block drbd0: uuid_compare()=0 by rule 40 Jan 24 14:42:40 de kernel: block drbd0: peer( Unknown -> Secondary ) conn( > WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) >From this log I don't understand why does it fail to restart. Now trying to restart backup partition: > $ sudo drbdadm down backup > $ sudo drbdadm up backup > strange bm_offset -65608 (expected: -65600) > No valid meta data found > Command 'drbdmeta 1 v08 /dev/sda3 internal apply-al' terminated with exit > code 255 Somehow metadata is not there anymore (?!). Note warning saying "strange bm_offset -65608 (expected: -65600)". Ok, how about recreate metadata and resync full partition? Trying to wipe old metadata shows that metadata is not present. Ok, let's pretend it's not there: > $ sudo drbdadm create-md backup > strange bm_offset -65608 (expected: -65600) > strange bm_offset -65608 (expected: -65600) > md_offset 1099511623680 > al_offset 1099511590912 > bm_offset 1099478036480 > Found ext3 filesystem > 1073709016 kB data area apparently used > 1073709020 kB left usable by current configuration > Even though it looks like this would place the new meta data into > unused space, you still need to confirm, as this is only a guess. > Do you want to proceed? > [need to type 'yes' to confirm] yes initializing activity log > NOT initializing bitmap > Writing meta data... > New drbd meta data block successfully created. > success Same warning twice + possible wrong fs detection (in fact it's ext4) and incorrect data usage detection (in fact takes 533GB of 1Tb partition) Wiping and creating metadata again fixes bm_offset warning until next drbd service restart. Having latest Centos 6.5. Kernel 2.6.32-504.1.3.el6.x86_64 Disks are based on hardware RAID 1. $ sudo fdisk -l > Disk /dev/sda: 3000.0 GB, 3000034656256 bytes > 255 heads, 63 sectors/track, 364733 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disk identifier: 0x0005a176 > Device Boot Start End Blocks Id System > /dev/sda1 1 1567 12582912+ 83 Linux > /dev/sda2 1567 1633 524288+ 83 Linux > /dev/sda3 1633 135307 1073741824+ 83 Linux > /dev/sda4 135307 364734 1842868224 f W95 Ext'd (LBA) > /dev/sda5 135308 251098 930086912+ 83 Linux > /dev/sda6 251098 364734 912779264 83 Linux backup uses /dev/sda3 and main uses /dev/sda6 Can someone hint me what's wrong am I doing? Will provide any additional info on request. Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150124/93aaa624/attachment.htm>