[DRBD-user] On-line verification problem

Marco Gaertner marco at gaertners.org
Thu May 21 23:30:37 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


David,

	Your analysis looks really impressive.  I have an issue that you may
have an insight on.

	I am running 8.3.1 on a geographically dispersed cluster.  In order
to compress the data exchange, and limit the bandwidth, I am using OpenVPN
to connect both machines.  The nature of OpenVPN in using UDP makes it so
the connection is dropped once in a while, though recovered quickly.  The
usage of the cluster is relatively sparse, so it affords me to do the above
with no issues.

	When I am doing a drbdadm verify all and the connection drops, the
verification also stops.  I wish it would continue where it last was.  So,
effectively, I can never verify the whole disk.  Now, I am also using
data-integrity-alg crc32c, so I am not so concerned.  However, do you know
if this is meant to be this way?  Do you have any suggestions?


Regards,
-Marco Gaertner

----------------------------------------------------------------------
Date: Thu, 21 May 2009 09:33:43 -0600
From: David.Livingstone at cn.ca
Subject: Re: [DRBD-user] On-line verification problem
To: drbd-user at lists.linbit.com

It seems like the "out-of-sync" log messages are output because the data
in those sectors was changing during the verification.

Here's how I came up with that...

For a /var/log/messages statement like this (notice the time stamp):
May 20 14:30:19 wimpas2 kernel: drbd0: Out of sync: start=137739248,
size=8 (sectors)

We can run a "dd" command to peek at what data it's talking about on
both servers:
(on wimpas1):  sudo dd if=/dev/mapper/VolGroup01-LogVol00 iflag=direct
bs=512 skip=137739248 count=8 of=/tmp/wimpas1-drbd-oos

(on wimpas2):  sudo dd if=/dev/mapper/VolGroup01-LogVol00 iflag=direct
bs=512 skip=137739248 count=8 of=/tmp/wimpas2-drbd-oos

Comparing the two output files using "diff" showed they were the same,
so that indicates replication worked properly.

Looking inside the files showed they were polling logs with timestamps
from the same time that the /var/log/messages statement was output:

eg) (snipped for brevity, notice the time stamps 20th day, 14:30:16 -
14:30:22)
time:20143016 REC fd:21 ff1216060100ef57000000000000f78f
time:20143016 TRA fd:21 12ff14000100e0a6 size:8 dur:0 OK
10ff803301001d190000030006001805416828285ac3075014070a3764464646000000191904
0505050505005701356014010200000000000a00015eba 
size:61 dur:0 OK
time:20143017 REC fd:21 ff0f160601002763000000000000f78f
time:20143017 TRA fd:21 0fff1400010097c9 size:8 dur:0 OK
12ff80330100154f0001030006001805416828285ac3075014070a3764464646000000191904
0505050505005300975014010200000000000a00048d21 
size:61 dur:0 OK
time:20143017 REC fd:21 ff1016060100f9df000000000000f78f
time:20143017 TRA fd:21 10ff14000100e8f0 size:8 dur:0 OK
0fff8033010062200000030006001805416828285ac3075014070a3764464646000000191904
0505050505005700974014010200000000000a000173b8 
size:61 dur:0 OK
time:20143019 REC fd:21 ff1216060100ef57000000000000f78f
time:20143019 TRA fd:21 12ff14000100e0a6 size:8 dur:0 OK
10ff803301001d190000030006001805416828285ac3075014070a3764464646000000191904
0505050505005701356014010200000000000a00015eba 
size:61 dur:0 OK
time:20143019 REC fd:21 ff0f160601002763000000000000f78f
time:20143019 TRA fd:21 0fff1400010097c9 size:8 dur:0 OK
12ff80330100154f0001030006001805416828285ac3075014070a3764464646000000191904
0505050505005300975014010200000000000a00048d21 
size:61 dur:0 OK
time:20143020 REC fd:21 ff1016060100f9df000000000000f78f
time:20143020 TRA fd:21 10ff14000100e8f0 size:8 dur:0 OK
0fff8033010062200000030006001805416828285ac3075014070a3764464646000000191904
0505050505005700974014010200000000000a000173b8 
size:61 dur:0 OK
time:20143021 REC fd:21 ff1216060100ef57000000000000f78f
time:20143021 TRA fd:21 12ff14000100e0a6 size:8 dur:0 OK
10ff803301001d190000030006001805416828285ac3075014070a3764464646000000191904
0505050505005701356014010200000000000a00015eba 
size:61 dur:0 OK
time:20143022 REC fd:21 ff0f160601002763000000000000f78f
time:20143022 TRA fd:21 0fff1400010097c9 size:8 dur:0 OK


So, the theory right now is that the "out-of-sync" messages were because
the data in those sectors was changing during the verification and the
"0 KB (0 bits) marked out-of-sync" means DRBD realized that. 



David Livingstone/LIVING03/CNR/CA wrote on 2009/05/20 16:44:26:

> I have a two-node cluster(wimpas1/2) running drbd 8.3.1 which
> I just enabled for online verification(verify-alg crc32c).
> When I ran it for the first time today I received a number of
> "Out-of-sync" messages which I subsequently corrected by
> disconnecting and connecting the resource. After some
> successful failover tests(running latest heartbeat/pacemaker)
> I then ran "drbdadm verify r0" again and to my surprise found
> more "Out-of-sync" messages.
> 
> The drbd link is good with no reported errors.
> Ideas ? While the verify is running changes are
> being made to the disk - do I really have errors ?
> From the logs below does the "0 KB (0 bits) marked out-of-sync"
> really mean I do not have any errors ?
> 
> Thanks

> 
> Background :
> - 2 x Proliant DL380G5 with SAS drives and
> Raid 6 /dev/mapper/VolGroup01-LogVol00 ext3 drbd partition.
> - Drbd 8.3.1 built from source.
> - Latest RH 5.3 with kernel 2.6.18-128.1.6.el5PAE.
> - drbd link is over 10Gb nic : HP NC510C(NetXen) using
> nx_nic-3.4.337-1 and nx_lsa-3.4.337-1. The offload
> feature(nx_lsa) is not used.
> 
> Given below :
> 1. First test of verify.
> 2. Second test of verify.
> 3. drbd.conf
> 
> 1. First test of verify :

> - test verify :
> - [root at wimpas2 etc]# drbd-overview
> 0:r0  Connected Primary/Secondary UpToDate/UpToDate C r---- /drbd 
> ext3 270G 106G 151G 42%
> - Run the verify on resource r0 :
> - [root at wimpas2 etc]# drbdadm verify r0
> - idle goes from 88% to 79%
> - drbd-overview shows verify is ongoing however it does not show the
> progress of the verify :
> - [root at wimpas2 etc]# drbd-overview
> 0:r0  VerifyS Primary/Secondary UpToDate/UpToDate C r---- /drbd ext3
> 270G 106G 151G 42%
> - To show the progress we need to "cat /proc/drbd" :
> [root at wimpas2 etc]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> ns:1602753144 nr:408740 dw:1603218784 dr:204343509 al:14977667 bm:
> 1160 lo:139 pe:126 ua:638 ap:25 ep:1 wo:d oos:0
> 42%      30793593/71661420
> [root at wimpas2 etc]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> ns:1605444232 nr:408740 dw:1605909876 dr:263949733 al:15006416 bm:
> 1160 lo:215 pe:499 ua:213 ap:2 ep:1 wo:d oos:0
> 63%      45596210/71661420
> [root at wimpas2 etc]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> ns:1609901224 nr:408740 dw:1610366868 dr:364980653 al:15058780 bm:
> 1160 lo:151 pe:64 ua:146 ap:5 ep:1 wo:d oos:0
> 98%      70637034/71661420
> [root at wimpas2 etc]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> ns:1610031364 nr:408740 dw:1610497008 dr:368031373 al:15060068 bm:
> 1160 lo:160 pe:112 ua:153 ap:7 ep:1 wo:d oos:0
> 99%      71392390/71661420
> [root at wimpas2 etc]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> ns:1610156492 nr:408740 dw:1610622136 dr:369146845 al:15061320 bm:
> 1160 lo:4 pe:0 ua:0 ap:4 ep:1 wo:d oos:0
> [root at wimpas2 etc]# drbd-overview
> 0:r0  Connected Primary/Secondary UpToDate/UpToDate C r---- /drbd 
> ext3 270G 108G 148G 43%
> [root at wimpas2 etc]#

> - from messages log :

> May 20 08:36:20 wimpas1 kernel: drbd0: conn( Connected -> VerifyT )
> May 20 08:51:42 wimpas1 kernel: drbd0: Out of sync: start=53862808, 
> size=8 (sectors)
> May 20 08:51:47 wimpas1 kernel: drbd0: Out of sync: start=54175416, 
> size=8 (sectors)
> May 20 08:51:50 wimpas1 kernel: drbd0: Out of sync: start=54300672, 
> size=8 (sectors)
> May 20 08:52:05 wimpas1 kernel: drbd0: Out of sync: start=55233848, 
> size=8 (sectors)
> May 20 09:16:22 wimpas1 kernel: drbd0: Out of sync: start=140359968,
> size=8 (sectors)
> May 20 11:22:41 wimpas1 kernel: drbd0: Online verify  done (total 
> 9981 sec; paused 0 sec; 28716 K/sec)
> May 20 11:22:41 wimpas1 kernel: drbd0: conn( VerifyT -> Connected )
> May 20 11:22:41 wimpas1 kernel: drbd0: Writing the whole bitmap, due
> to failed kmalloc
> May 20 11:22:41 wimpas1 kernel: drbd0: 0 KB (0 bits) marked out-of-
> sync by on disk bit-map.

> - now on wimpas2 do a :
> - drbdadm disconnect r0
> - drbdadm connect r0
> This should correct out-of-sync blocks :
> 
> - on wimpas2 :

> [root at wimpas2 etc]# drbdadm disconnect r0
> [root at wimpas2 etc]# drbdadm connect r0

> - From messages log on wimpas1 :

> May 20 11:25:49 wimpas1 kernel: drbd0: peer( Primary -> Unknown ) 
> conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
> May 20 11:25:49 wimpas1 kernel: drbd0: asender terminated
> May 20 11:25:49 wimpas1 kernel: drbd0: Terminating asender thread
> May 20 11:25:49 wimpas1 kernel: drbd0: Connection closed
> May 20 11:25:49 wimpas1 kernel: drbd0: conn( TearDown -> Unconnected )
> May 20 11:25:49 wimpas1 kernel: drbd0: receiver terminated
> May 20 11:25:49 wimpas1 kernel: drbd0: Restarting receiver thread
> May 20 11:25:49 wimpas1 kernel: drbd0: receiver (re)started
> May 20 11:25:49 wimpas1 kernel: drbd0: conn( Unconnected -> WFConnection 
)
> May 20 11:25:56 wimpas1 kernel: drbd0: Handshake successful: Agreed 
> network protocol version 89
> May 20 11:25:56 wimpas1 kernel: drbd0: conn( WFConnection -> 
WFReportParams )
> May 20 11:25:56 wimpas1 kernel: drbd0: Starting asender thread (from
> drbd0_receiver [4132])
> May 20 11:25:56 wimpas1 kernel: drbd0: data-integrity-alg: <not-used>
> May 20 11:25:56 wimpas1 kernel: drbd0: drbd_sync_handshake:
> May 20 11:25:56 wimpas1 kernel: drbd0: self A33677DBB2985460:
> 0000000000000000:6D2B4B9CDDBDEFD6:37F8B9BC0B605BA7 bits:0 flags:0
> May 20 11:25:56 wimpas1 kernel: drbd0: peer 243B1079C9753DEB:
> A33677DBB2985461:6D2B4B9CDDBDEFD6:37F8B9BC0B605BA7 bits:1319 flags:0
> May 20 11:25:56 wimpas1 kernel: drbd0: uuid_compare()=-1 by rule 5
> May 20 11:25:56 wimpas1 kernel: drbd0: peer( Unknown -> Primary ) 
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> May 20 11:25:56 wimpas1 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
> May 20 11:25:56 wimpas1 kernel: drbd0: helper command: /sbin/drbdadm
> before-resync-target minor-0
> May 20 11:25:56 wimpas1 kernel: drbd0: helper command: /sbin/drbdadm
> before-resync-target minor-0 exit code 0 (0x0)
> May 20 11:25:56 wimpas1 kernel: drbd0: conn( WFSyncUUID -> 
> SyncTarget ) disk( UpToDate -> Inconsistent )
> May 20 11:25:56 wimpas1 kernel: drbd0: Began resync as SyncTarget 
> (will sync 5276 KB [1319 bits set]).
> May 20 11:25:56 wimpas1 kernel: drbd0: Resync done (total 1 sec; 
> paused 0 sec; 5276 K/sec)
> May 20 11:25:56 wimpas1 kernel: drbd0: conn( SyncTarget -> Connected
> ) disk( Inconsistent -> UpToDate )
> May 20 11:25:56 wimpas1 kernel: drbd0: helper command: /sbin/drbdadm
> after-resync-target minor-0
> May 20 11:25:56 wimpas1 kernel: drbd0: helper command: /sbin/drbdadm
> after-resync-target minor-0 exit code 0 (0x0)

> - From messages log on wimpas2 :

> May 20 11:25:56 wimpas2 kernel: drbd0: conn( StandAlone -> Unconnected )
> May 20 11:25:56 wimpas2 kernel: drbd0: Starting receiver thread 
> (from drbd0_worker [4143])
> May 20 11:25:56 wimpas2 kernel: drbd0: receiver (re)started
> May 20 11:25:56 wimpas2 kernel: drbd0: conn( Unconnected -> WFConnection 
)
> May 20 11:25:56 wimpas2 kernel: drbd0: Handshake successful: Agreed 
> network protocol version 89
> May 20 11:25:56 wimpas2 kernel: drbd0: conn( WFConnection -> 
WFReportParams )
> May 20 11:25:56 wimpas2 kernel: drbd0: Starting asender thread (from
> drbd0_receiver [384])
> May 20 11:25:56 wimpas2 kernel: drbd0: data-integrity-alg: <not-used>
> May 20 11:25:56 wimpas2 kernel: drbd0: drbd_sync_handshake:
> May 20 11:25:56 wimpas2 kernel: drbd0: self 243B1079C9753DEB:
> A33677DBB2985461:6D2B4B9CDDBDEFD6:37F8B9BC0B605BA7 bits:1319 flags:0
> May 20 11:25:56 wimpas2 kernel: drbd0: peer A33677DBB2985460:
> 0000000000000000:6D2B4B9CDDBDEFD6:37F8B9BC0B605BA7 bits:0 flags:0
> May 20 11:25:56 wimpas2 kernel: drbd0: uuid_compare()=1 by rule 7
> May 20 11:25:56 wimpas2 kernel: drbd0: peer( Unknown -> Secondary ) 
> conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
> May 20 11:25:56 wimpas2 kernel: drbd0: conn( WFBitMapS -> SyncSource
> ) pdsk( UpToDate -> Inconsistent )
> May 20 11:25:56 wimpas2 kernel: drbd0: Began resync as SyncSource 
> (will sync 5276 KB [1319 bits set]).
> May 20 11:25:56 wimpas2 kernel: drbd0: Resync done (total 1 sec; 
> paused 0 sec; 5276 K/sec)
> May 20 11:25:56 wimpas2 kernel: drbd0: conn( SyncSource -> Connected
> ) pdsk( Inconsistent -> UpToDate )

> 
> 2. Second test of verify.

> - [root at wimpas2]# drbdadm verify r0

> - on wimpas we are still getting Out of sync :
> [root at wimpas2 ~]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:VerifyS ro:Primary/Secondary ds:UpToDate/UpToDate C r---b
> ns:8823180 nr:520920 dw:9344100 dr:213062745 al:100758 bm:591 lo:136
> pe:102 ua:136 ap:0 ep:1 wo:d oos:0
> 74%      53161685/71661420
> [root at wimpas2 ~]# cat /proc/drbd
> version: 8.3.1 (api:88/proto:86-89)
> GIT-hash: fd40f4a8f9104941537d1afc8521e584a6d3003c build by 
> root at wimpas2, 2009-04-16 11:32:59
> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> ns:11781808 nr:520920 dw:12302728 dr:287072981 al:135343 bm:591 lo:0
> pe:0 ua:0 ap:0 ep:1 wo:d oos:0
> 
> - from messages log :

> May 20 13:50:53 wimpas1 kernel: drbd0: conn( Connected -> VerifyT )
> May 20 14:30:19 wimpas1 kernel: drbd0: Out of sync: start=137739248,
> size=8 (sectors)
> May 20 14:35:37 wimpas1 kernel: drbd0: Out of sync: start=156313064,
> size=8 (sectors)
> May 20 14:39:00 wimpas1 kernel: drbd0: Out of sync: start=168175112,
> size=8 (sectors)
> May 20 14:39:32 wimpas1 kernel: drbd0: Out of sync: start=170062360,
> size=8 (sectors)
> May 20 14:45:31 wimpas1 kernel: drbd0: Out of sync: start=190982904,
> size=8 (sectors)
> May 20 14:45:45 wimpas1 kernel: drbd0: Out of sync: start=191782000,
> size=8 (sectors)
> May 20 14:45:50 wimpas1 kernel: drbd0: Out of sync: start=192029480,
> size=8 (sectors)
> May 20 14:47:07 wimpas1 kernel: drbd0: Out of sync: start=196547400,
> size=8 (sectors)
> May 20 14:56:18 wimpas1 kernel: drbd0: Out of sync: start=228776888,
> size=8 (sectors)
> May 20 14:56:22 wimpas1 kernel: drbd0: Out of sync: start=229022968,
> size=8 (sectors)
> May 20 14:56:22 wimpas1 kernel: drbd0: Out of sync: start=229047480,
> size=8 (sectors)
> May 20 14:56:24 wimpas1 kernel: drbd0: Out of sync: start=229145752,
> size=8 (sectors)
> May 20 14:56:25 wimpas1 kernel: drbd0: Out of sync: start=229198968,
> size=8 (sectors)
> May 20 14:56:26 wimpas1 kernel: drbd0: Out of sync: start=229248104,
> size=8 (sectors)
> May 20 14:56:27 wimpas1 kernel: drbd0: Out of sync: start=229289008,
> size=8 (sectors)
> May 20 16:37:54 wimpas1 kernel: drbd0: Online verify  done (total 
> 10021 sec; paused 0 sec; 28604 K/sec)
> May 20 16:37:54 wimpas1 kernel: drbd0: conn( VerifyT -> Connected )
> May 20 16:37:54 wimpas1 kernel: drbd0: Writing the whole bitmap, due
> to failed kmalloc
> May 20 16:37:54 wimpas1 kernel: drbd0: 0 KB (0 bits) marked out-of-
> sync by on disk bit-map.

> 
> 3. drbd.conf

> global {
> minor-count 1;
> }

> resource r0 {
> protocol C;

> on wimpas1 {
> device /dev/drbd0; # The name of our drbd device.
> disk /dev/mapper/VolGroup01-LogVol00;    # Partition we wish drbd to 
use.
> address 192.168.36.129:7788; # node0 IP address and port number.
> meta-disk internal; # Stores meta-data in lower portion of hda5.
> }

> on wimpas2 {
> device /dev/drbd0; # Our drbd device, must match node0.
> disk /dev/mapper/VolGroup01-LogVol00;    # Partition we wish drbd to 
use.
> address 192.168.36.130:7788; # node0 IP address and port number.
> meta-disk internal; # Stores meta-data in lower portion of hda5.
> }

> disk {
> on-io-error detach; # What to do when the lower level device errors.
> }

> net {
> max-buffers 2048; #datablock buffers used before writing to disk.
> ko-count 4; # Peer is dead if this count is exceeded.
> #on-disconnect reconnect; # Peer disconnected, try to reconnect.
> }

> syncer {
> rate 29M;
> #rate 143M; # Used for first sync
> #group 1;  # Used for grouping resources, parallel sync.
> al-extents 257; # Must be prime, number of active sets.
> verify-alg crc32c;
> }

> startup {
> wfc-timeout 120; # drbd init script will wait 2 minutes - 0 is 
indefinite.
> degr-wfc-timeout 120; # 2 minutes.
> }
> } # End of resource
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.linbit.com/pipermail/drbd-user/attachments/20090521/b8b904ec/at
tachment.htm 

------------------------------

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


End of drbd-user Digest, Vol 58, Issue 26
*****************************************




More information about the drbd-user mailing list