[DRBD-user] Lost of connections using Gigabit card

Lars Ellenberg Lars.Ellenberg at linbit.com
Thu Jul 29 00:19:41 CEST 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


/ 2004-07-28 18:38:52 -0300
\ swamaral at ig.com.br:
> I am experiencing breaking of connections with drbd. The nodes have two 
> Gigabit nics and are linked by a crossover Gigabit cable. The first node 
> (master) boots and awaits for the second (slave) to connect. When slave 
> starts the drbd service I get this messages: 
> 
> Setting up 'drbd0' .. disk ok .. net .. OK 
> drbd0: magic?? m: 0 c: 0 l: 0 

each packet we send has a header.
the first thing we transmit is a magic number.
the first thing we do is: we compare it to the expected value.

since we use tcp, and each tcp packet is signed by some tcp checksum,
this is an interessting thing:
something in your box corrupts memory either on the sending box
before the tcp checksum is generated, or on the receiving box after the
checksum has been verified.

connection should be aborted and drbd should
try to reconnect on these events.

several such messages in a row without a reconnect in between show a
locig error in drbd code.

anyways:
 some of your hardware (or maybe the nic driver)
 is seriously _broke_.

> And sometimes and protection fault is raised, but the system doesn't hangs. 

any TX errors displayed with ifconfig?
bad ram?

> Both master and slave nodes uses the TEG-PICTX2 (NWay Cooper Gigabit 
> Ethernet Adapter) nic. The kernel modules that I've tryed were the ns83820 
> (suggested by kudzu) and the dpm (supplyed by the vendor). I downloaded the 
> more recent driver version from the vendor's site (www.trendnet.com), but, 
> for my great surprise, the module doens't compile due to many compilation 
> erros. I notifyed the vendor's support and asked for the new release, but 
> they send me the same old version that I already have. 

in any case, this is not a drbd problem. but an interessting test
environment to see how drbd behaves in such scenarios :-]

> Bellow are the /var/log/messages from the both nodes: 
> 
> (on master) 
> .. 
> Jan 31 17:03:25 master kernel: drbd: initialised. Version: 0.6.10 

ok, 0.6.10...

> (api:64/proto:62) 
> Jan 31 17:04:18 master kernel: drbd0: Connection established. size=2040220 KB / blksize=4096 B 
> Jan 31 17:04:18 master kernel: klogd 1.4.1, ---------- state change ---------- 
> Jan 31 17:04:18 master kernel: drbd0: Synchronisation started blks=15 
> Jan 31 17:04:19 master kernel: drbd0: sock_sendmsg returned -32 
> Jan 31 17:04:19 master kernel: drbd0: syncer send failed!! 
> Jan 31 17:04:19 master kernel: drbd0: Syncer send failed. 
> Jan 31 17:04:19 master kernel: drbd0: Connection lost. 
> Jan 31 17:04:29 master kernel: drbd0: Connection established. size=2040220 KB / blksize=4096 B 

not interessting...

> (on slave) 
> .. 
> Jan 31 16:15:40 slave kernel: drbd: initialised. Version: 0.6.10  (api:64/proto:62) 
> Jan 31 16:15:40 slave kernel: drbd0: Connection established. size=2040220 KB / blksize=4096 B 
> Jan 31 16:15:40 slave kernel: klogd 1.4.1, ---------- state change ---------- 
> Jan 31 16:15:40 slave kernel: drbd0: magic?? m: 0 c: 0 l: 0 

you receive something that is two 32bit words of zero, where we expect
a magic number, a 16bit command number, and a 16bit payload data length indicator.
we cannot do anything about that, but drop the connection.
we never send such a beast on purpose.
that it survives the tcp checksumming indicates serious bus or ram problems,
or a really buggy nic driver.

> The kernel is 2.4.22 and the distro is RH9. 
> 
> 
> Any help will be very appretiated. 
> If this informations were not enough, please ask me for more datails. 

what? even more details?

what you could try is a tcpdump -s0 -i ethX -w some-dump-file,
(for two such connection lost/connection established cycles),
compress it, and send it to me, just for the fun of it.

what you _should_ do is repeatedly transfer large known data files
inependently from drbd via nfs/smb/ftp/netcat/scp/whatever on that link
(try more than one protocol), and verify their checksums... I bet they
differ on each attempt, maybe even your ftp clients will die ...

but I may be wrong, of course

 ;)


	Lars Ellenberg

-- 
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list