Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
hello!
yesterday i upgraded disks in our drbd cluster wich involved a full sync
between the nodes (thanks again lars for providing support).
i used the opportunity to upgrade drbd to latest cvs and to switch to
protocol C. during fullsync some logmessages showed up i want to share
(please excuse the lengthy mail).
kaelte was primary on all devices and holding all services of the
cluster. timeline of what i did on kaelte [0]
now during fullsync and afterwards i got a lot [1] of messages like:
----------->8------------------------------------------------------
Feb 3 21:01:56 kaelte kernel: drbd2: [drbd_syncer_2/27957] sock_sendmsg
returned -32
Feb 3 21:01:56 kaelte kernel: drbd2: Syncer send failed.
Feb 3 21:01:56 kaelte kernel: drbd2: Connection lost.
Feb 3 21:01:56 kaelte kernel: drbd2: Connection established.
size=19535008 KB / blksize=4096 B
Feb 3 21:01:56 kaelte kernel: drbd2: Synchronisation started blks=15
Feb 3 21:02:31 kaelte kernel: drbd4: [drbd_syncer_4/27961] sock_sendmsg
time expired, ko = 4294967295
Feb 3 21:02:36 kaelte kernel: drbd0: [drbd_syncer_0/27953] sock_sendmsg
time expired, ko = 4294967295
Feb 3 21:02:47 kaelte kernel: drbd4: [drbd_syncer_4/27961] sock_sendmsg
time expired, ko = 4294967295
Feb 3 21:03:06 kaelte kernel: drbd2: [drbd_syncer_2/30738] sock_sendmsg
time expired, ko = 4294967295
Feb 3 21:03:23 kaelte kernel: drbd4: Synchronisation done.
Feb 3 21:08:10 kaelte kernel: drbd0: [drbd_syncer_0/27953] sock_sendmsg
returned -32
Feb 3 21:08:10 kaelte kernel: drbd0: Syncer send failed.
Feb 3 21:08:10 kaelte kernel: drbd0: Connection lost.
Feb 3 21:08:11 kaelte kernel: drbd0: Connection established.
size=9767488 KB / blksize=4096 B
Feb 3 21:08:11 kaelte kernel: drbd0: Synchronisation started blks=15
Feb 3 21:09:20 kaelte kernel: drbd2: [drbd_syncer_2/30738] sock_sendmsg
time expired, ko = 4294967295
Feb 3 21:17:14 kaelte kernel: drbd2: [drbd_syncer_2/30738] sock_sendmsg
returned -32
Feb 3 21:17:14 kaelte kernel: drbd2: Syncer send failed.
Feb 3 21:17:14 kaelte kernel: drbd2: Connection lost.
Feb 3 21:17:14 kaelte kernel: drbd2: Connection established.
size=19535008 KB / blksize=4096 B
Feb 3 21:17:14 kaelte kernel: drbd2: Synchronisation started blks=15
Feb 3 21:18:50 kaelte kernel: drbd2: [drbd_syncer_2/2439] send timed
out!!
Feb 3 21:18:50 kaelte kernel: drbd2: Syncer send failed.
Feb 3 21:18:50 kaelte kernel: drbd2: Connection lost.
Feb 3 21:18:50 kaelte kernel: drbd2: Connection established.
size=19535008 KB / blksize=4096 B
----------->8--------------------------------------------------------
Feb 3 20:49:27 atem kernel: drbd5: Creating state file
Feb 3 20:49:27 atem kernel: "/var/lib/drbd/drbd5"
Feb 3 20:49:29 atem kernel: drbd5: Connection established. size=96358
KB / blksize=1024 B
Feb 3 21:01:56 atem kernel: drbd2: unknown packet type!
Feb 3 21:01:56 atem kernel: drbd2: Connection lost.
Feb 3 21:01:56 atem kernel: drbd2: Connection established.
size=19535008 KB / blksize=4096 B
Feb 3 21:08:10 atem kernel: drbd0: unknown packet type!
Feb 3 21:08:10 atem kernel: drbd0: Connection lost.
Feb 3 21:08:11 atem kernel: drbd0: Connection established. size=9767488
KB / blksize=4096 B
Feb 3 21:17:13 atem kernel: drbd2: unknown packet type!
Feb 3 21:17:14 atem kernel: drbd2: Connection lost.
Feb 3 21:17:14 atem kernel: drbd2: Connection established.
size=19535008 KB / blksize=4096 B
Feb 3 21:18:50 atem kernel: drbd2: unknown packet type!
Feb 3 21:18:50 atem kernel: drbd2: Connection lost.
Feb 3 21:18:50 atem kernel: drbd2: Connection established.
size=19535008 KB / blksize=4096 B
----------->8--------------------------------------------------------
i read the other thread about "sock_sendmsg time expired, ko = xxxxx"
but i didnt run into problems of lockups on neither primary nor
secondary like ward did.
i also stumbled across
http://thread.gmane.org/gmane.comp.linux.drbd/5571
and i understand i have to tune the net sections [2] with my drbd
devices. maybe sync-nice=0 is too harsh? i'm just not sure how, cos i am
not really satisfied with the sync speed i get.
i use intel gigabit nics on 2.4.24 with rx polling support (mtu 1500)
for drbd. the two nodes are connected via crossover cable and sync speed
maxes out at approx. 13MB/s while i know that the disks do better [3]
and i know that the network interfaces can do better.
i tested net troughput with netstrain and got nice figures [4]
another note on synching: during fullsync pe was >> 0 on the sending
side and ua >> 0 on the receiving side; is this normal? after sync both
coloumns went back to zero.
another oddity that showed up in the logs was:
----------->8--------------------------------------------------------
Feb 4 03:00:05 kaelte kernel: drbd2: pending_cnt <0 !!!
----------->8--------------------------------------------------------
didn't see that for quite some time; last time i saw it it came along
with a deadlock on primary :-/
regards m
[0] drbd disk-size upgrade
----------->8--------------------------------------------------------
# drop all connections to drbd services
iptables -A INPUT -i eth0 -p tcp --destination-port ! 22 -j DROP
cd /etc/init.d
for i in postfix spamd courier-* httpd mysqld proftpd slapd
do
./$i stop
done
ps ax # verify that services are all stopped
for i in $(seq 0 5)
do
umount /dev/nb$i
done
ssh atem "drbd stop"
cp /etc/drbd.conf.new /etc/drbd.conf
/etc/init.d/drbd restart
for i in 0 2 3
do
e2fsck -f /dev/nb$i
resize2fs /dev/nb$i
done
mount /dev/nb0
df # verify disksize
umount /dev/nb0
# disksize ok? allright then:
ssh atem "cp /etc/drbd.conf.new /etc/drbd.conf
rm /var/lib/drbd/*
/etc/init.d/drbd start"
----------->8--------------------------------------------------------
[1]
----------->8--------------------------------------------------------
kaelte:root# grep -c "sock_sendmsg time expired" \
/var/log/syslog.0 /var/log/syslog
/var/log/syslog.0:374
/var/log/syslog:30
kaelte:root#
----------->8--------------------------------------------------------
[2] drbd.conf excerpt
----------->8--------------------------------------------------------
net {
sync-rate=60000
sync-nice=0
tl-size=5000
timeout=10
connect-int=10
ping-int=10
}
----------->8--------------------------------------------------------
[3] disk performance local vs. drbd
all filesystems are ext3 in ordered data mode, additionally drbd devices
are mounted with "noatime"
disks are two scsi uw 10k rpm 73GB in a logical RAID 1 array on each
node with HP netraid 1M hardware raid controller
----------->8--------------------------------------------------------
kaelte:root# # local filesystem
kaelte:root# sync; \
time dd if=/dev/zero of=/opt/400M bs=4k count=100000; \
time sync
100000+0 Records ein
100000+0 Records aus
real 0m4.391s
user 0m0.100s
sys 0m3.750s
real 0m12.580s
user 0m0.000s
sys 0m0.170s
kaelte:root#
# drbd fs
kaelte:root# sync; \
time dd if=/dev/zero of=/var/spool/courier/400M bs=4k count=100000; \
time sync
100000+0 Records ein
100000+0 Records aus
real 0m4.470s
user 0m0.060s
sys 0m4.050s
real 0m36.050s
user 0m0.000s
sys 0m2.520s
kaelte:root#
----------->8--------------------------------------------------------
[4] net throughput
----------->8--------------------------------------------------------
atem:bin# netstrain kaelte_priv 3333 send
NetStrain 3.0 (c) 2002 Christoph Pfisterer <cp at chrisp.de>
Looking up hostname kaelte_priv...
Connecting to 192.168.0.3 port 3333 using IPv4...
Connected
sent: 2796M, 97707.8K/s total, 109141.1K/s current
recv'd: 0B, 0B/s total, 0B/s current
atem:bin# # wow!
atem:bin# netstrain kaelte_priv 3333 both
NetStrain 3.0 (c) 2002 Christoph Pfisterer <cp at chrisp.de>
Looking up hostname kaelte_priv...
Connecting to 192.168.0.3 port 3333 using IPv4...
Connected
sent: 805M, 34737.8K/s total, 40908.3K/s current
recv'd: 800M, 34484.7K/s total, 40712.1K/s current
atem:bin# # not bad...
----------->8--------------------------------------------------------
--
vi vi vi - editor of the beast
vim vim vim - editor of the *ALL* *NEW* improved beast