[DRBD-user] files corrupted on secondary

Florin Cazacu florinc at reecemarketing.com
Fri Sep 1 14:43:13 CEST 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello, I run drbd-0.7.20 on 2x dell poweredge 2650, with raid 5, 
linux2.4, ext3 fs (I use 2.4 because one year ago looked like the the 
only driver/kernel combination that will work corectly with my perc 3/Di 
controller). Tonight I have a problem with a unavailable service. The 
secondary tryied to take over, but the filesystem on it was currupted. 
My failover script tryes to make the primary seconday, then secondary is 
made primary. It looks like it did that, as I found the 
primary/secondary switched, and the state was connected, both nodes 
consistent. The primary machine was fine, the script was triggered by an 
apache restart (I'll work on it).


After I saw the fs is imposible to use, I disconnected the devices, made 
the broken node primary again, ran fsck, it found 4 broken files, but 
other then that, the filesystem could be used.

These nodes have about 2 years since they function. I had no problems, 
since drbd-0.7 was released when using it with ext3. I always updated 
drbd connecting 2 different verions of drdb (same protocol though).

The last time I switched the primary/secondary was 2 weeks ago, I had a 
full rsync. After that full rsync I switched them one more time, and 
they ended up in the configuration they where before the crash.

I configure the drbd device manually:
PROTOCOL="C"
RATE="400M"
DEVICE="/dev/drbd0"
DISK="/dev/sda5"
META="/dev/sda6"
LOCAL_ADDRESS="192.168.3.2"
REMOTE_ADDRESS="192.168.3.1"
DRBDSETUP="/sbin/drbdsetup"


$DRBDSETUP $DEVICE disk $DISK $META 0
$DRBDSETUP $DEVICE net $LOCAL_ADDRESS $REMOTE_ADDRESS $PROTOCOL
$DRBDSETUP $DEVICE syncer --rate $RATE

On the imposible to use fs I did all the checks available on the utility 
partition available on dells. Memory and disks are fine. I'm planning to 
do a backup, then a full rsync, and do full checks on the machine that 
used to be primary. If that machine is good too, I really ran out of 
options to identify the problem. I guess I will test the latest drbd 
with the latest 2.6 kernel.

This is like what I have in logs for this weeks, on the machine that 
used to be primary. Please note that I concatenated the drbd messages 
from syslog with the onces from messages. 19.35 is the time when the 
failover switch and the last disconnect is done by me, manually.

Thank you in advance for any suggetion on what could possible create 
this situation (/proc/drbd reports that the machines are both 
consistent/connected, but the filesystem is different).

Aug 25 07:07:33 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
SyncSource --> Connected
Aug 25 07:07:33 dell1 kernel: drbd0: meta connection shut down by peer.
Aug 25 07:07:33 dell1 kernel: drbd0: short read expecting header on 
sock: r=0
Aug 25 07:07:38 dell1 kernel: drbd0: sock was shut down by peer
Aug 25 07:07:38 dell1 kernel: drbd0: drbd0_asender [24289]: cstate 
Connected --> NetworkFailure
Aug 25 07:07:38 dell1 kernel: drbd0: asender terminated
Aug 25 07:07:38 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
NetworkFailure --> BrokenPipe
Aug 25 07:07:38 dell1 kernel: drbd0: worker terminated
Aug 25 07:07:38 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
BrokenPipe --> Unconnected
Aug 25 07:07:38 dell1 kernel: drbd0: Connection lost.
Aug 25 07:07:38 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
Unconnected --> WFConnection
Aug 25 07:07:38 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
WFConnection --> WFReportParams
Aug 25 07:07:38 dell1 kernel: drbd0: meta connection shut down by peer.
Aug 25 07:07:38 dell1 kernel: drbd0: short read expecting header on 
sock: r=0
Aug 25 07:07:39 dell1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Aug 25 07:07:39 dell1 kernel: drbd0: Connection established.
Aug 25 07:07:39 dell1 kernel: drbd0: I am(P): 
1:00000002:00000001:0000002d:00000012:10
Aug 25 07:07:39 dell1 kernel: drbd0: Peer(S): 
1:00000002:00000001:0000002c:00000012:01
Aug 25 07:07:39 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
WFReportParams --> WFBitMapS
Aug 25 07:07:39 dell1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Aug 25 07:07:39 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
WFBitMapS --> SyncSource
Aug 25 07:07:39 dell1 kernel: drbd0: Resync started as SyncSource (need 
to sync 0 KB [0 bits set]).
Aug 25 07:07:39 dell1 kernel: drbd0: Resync done (total 1 sec; paused 0 
sec; 0 K/sec)
Aug 25 07:07:39 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
SyncSource --> Connected
Aug 25 07:07:48 dell1 kernel: drbd0: [kupdated/9] sock_sendmsg time 
expired, ko = 4294967295
Aug 25 07:07:51 dell1 kernel: drbd0: drbd0_asender [24292]: cstate 
Connected --> NetworkFailure
Aug 25 07:07:51 dell1 kernel: drbd0: asender terminated
Aug 25 07:07:51 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
NetworkFailure --> BrokenPipe
Aug 25 07:07:51 dell1 kernel: drbd0: worker terminated
Aug 25 07:07:51 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
BrokenPipe --> Unconnected
Aug 25 07:07:51 dell1 kernel: drbd0: Connection lost.
Aug 25 07:07:51 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
Unconnected --> WFConnection
Aug 25 07:07:51 dell1 kernel: drbd0: [kupdated/9] sock_sendmsg time 
expired, ko = 4294967294
Aug 25 07:07:51 dell1 kernel: drbd0: PingAck did not arrive in time.
Aug 25 07:07:51 dell1 kernel: drbd0: short read expecting header on 
sock: r=-512
Aug 25 07:07:51 dell1 kernel: drbd0: _drbd_send_page: size=4096 len=240 
sent=-4
Aug 25 07:07:51 dell1 kernel: drbd0: short sent UnplugRemote size=8 
sent=-1001
Aug 25 07:23:02 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
WFConnection --> WFReportParams
Aug 25 07:23:02 dell1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Aug 25 07:23:02 dell1 kernel: drbd0: Connection established.
Aug 25 07:23:02 dell1 kernel: drbd0: I am(P): 
1:00000002:00000001:0000002e:00000012:10
Aug 25 07:23:02 dell1 kernel: drbd0: Peer(S): 
1:00000002:00000001:0000002d:00000012:01
Aug 25 07:23:02 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
WFReportParams --> WFBitMapS
Aug 25 07:23:02 dell1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Aug 25 07:23:02 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
WFBitMapS --> SyncSource
Aug 25 07:23:02 dell1 kernel: drbd0: Resync started as SyncSource (need 
to sync 39944 KB [9986 bits set]).
Aug 25 07:23:07 dell1 kernel: drbd0: Resync done (total 4 sec; paused 0 
sec; 9984 K/sec)
Aug 25 07:23:07 dell1 kernel: drbd0: drbd0_worker [24297]: cstate 
SyncSource --> Connected
Aug 31 19:35:49 dell1 kernel: drbd0: Primary/Secondary --> 
Secondary/Secondary
Aug 31 19:36:19 dell1 kernel: drbd0: Secondary/Secondary --> 
Secondary/Primary
Aug 31 19:48:11 dell1 kernel: drbd0: Secondary/Primary --> 
Secondary/Secondary
Aug 31 19:53:17 dell1 kernel: drbd0: Secondary/Secondary --> 
Secondary/Primary
Aug 31 19:56:26 dell1 kernel: drbd0: Not in Primary state, no IO 
requests allowed
Aug 31 20:38:07 dell1 kernel: drbd0: sock was shut down by peer
Aug 31 20:38:07 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
Connected --> BrokenPipe
Aug 31 20:38:07 dell1 kernel: drbd0: worker terminated
Aug 31 20:38:07 dell1 kernel: drbd0: asender terminated
Aug 31 20:38:07 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
BrokenPipe --> Unconnected
Aug 31 20:38:07 dell1 kernel: drbd0: Connection lost.
Aug 31 20:38:07 dell1 kernel: drbd0: drbd0_receiver [32273]: cstate 
Unconnected --> WFConnection





More information about the drbd-user mailing list