Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
hello! yesterday i upgraded disks in our drbd cluster wich involved a full sync between the nodes (thanks again lars for providing support). i used the opportunity to upgrade drbd to latest cvs and to switch to protocol C. during fullsync some logmessages showed up i want to share (please excuse the lengthy mail). kaelte was primary on all devices and holding all services of the cluster. timeline of what i did on kaelte [0] now during fullsync and afterwards i got a lot [1] of messages like: ----------->8------------------------------------------------------ Feb 3 21:01:56 kaelte kernel: drbd2: [drbd_syncer_2/27957] sock_sendmsg returned -32 Feb 3 21:01:56 kaelte kernel: drbd2: Syncer send failed. Feb 3 21:01:56 kaelte kernel: drbd2: Connection lost. Feb 3 21:01:56 kaelte kernel: drbd2: Connection established. size=19535008 KB / blksize=4096 B Feb 3 21:01:56 kaelte kernel: drbd2: Synchronisation started blks=15 Feb 3 21:02:31 kaelte kernel: drbd4: [drbd_syncer_4/27961] sock_sendmsg time expired, ko = 4294967295 Feb 3 21:02:36 kaelte kernel: drbd0: [drbd_syncer_0/27953] sock_sendmsg time expired, ko = 4294967295 Feb 3 21:02:47 kaelte kernel: drbd4: [drbd_syncer_4/27961] sock_sendmsg time expired, ko = 4294967295 Feb 3 21:03:06 kaelte kernel: drbd2: [drbd_syncer_2/30738] sock_sendmsg time expired, ko = 4294967295 Feb 3 21:03:23 kaelte kernel: drbd4: Synchronisation done. Feb 3 21:08:10 kaelte kernel: drbd0: [drbd_syncer_0/27953] sock_sendmsg returned -32 Feb 3 21:08:10 kaelte kernel: drbd0: Syncer send failed. Feb 3 21:08:10 kaelte kernel: drbd0: Connection lost. Feb 3 21:08:11 kaelte kernel: drbd0: Connection established. size=9767488 KB / blksize=4096 B Feb 3 21:08:11 kaelte kernel: drbd0: Synchronisation started blks=15 Feb 3 21:09:20 kaelte kernel: drbd2: [drbd_syncer_2/30738] sock_sendmsg time expired, ko = 4294967295 Feb 3 21:17:14 kaelte kernel: drbd2: [drbd_syncer_2/30738] sock_sendmsg returned -32 Feb 3 21:17:14 kaelte kernel: drbd2: Syncer send failed. Feb 3 21:17:14 kaelte kernel: drbd2: Connection lost. Feb 3 21:17:14 kaelte kernel: drbd2: Connection established. size=19535008 KB / blksize=4096 B Feb 3 21:17:14 kaelte kernel: drbd2: Synchronisation started blks=15 Feb 3 21:18:50 kaelte kernel: drbd2: [drbd_syncer_2/2439] send timed out!! Feb 3 21:18:50 kaelte kernel: drbd2: Syncer send failed. Feb 3 21:18:50 kaelte kernel: drbd2: Connection lost. Feb 3 21:18:50 kaelte kernel: drbd2: Connection established. size=19535008 KB / blksize=4096 B ----------->8-------------------------------------------------------- Feb 3 20:49:27 atem kernel: drbd5: Creating state file Feb 3 20:49:27 atem kernel: "/var/lib/drbd/drbd5" Feb 3 20:49:29 atem kernel: drbd5: Connection established. size=96358 KB / blksize=1024 B Feb 3 21:01:56 atem kernel: drbd2: unknown packet type! Feb 3 21:01:56 atem kernel: drbd2: Connection lost. Feb 3 21:01:56 atem kernel: drbd2: Connection established. size=19535008 KB / blksize=4096 B Feb 3 21:08:10 atem kernel: drbd0: unknown packet type! Feb 3 21:08:10 atem kernel: drbd0: Connection lost. Feb 3 21:08:11 atem kernel: drbd0: Connection established. size=9767488 KB / blksize=4096 B Feb 3 21:17:13 atem kernel: drbd2: unknown packet type! Feb 3 21:17:14 atem kernel: drbd2: Connection lost. Feb 3 21:17:14 atem kernel: drbd2: Connection established. size=19535008 KB / blksize=4096 B Feb 3 21:18:50 atem kernel: drbd2: unknown packet type! Feb 3 21:18:50 atem kernel: drbd2: Connection lost. Feb 3 21:18:50 atem kernel: drbd2: Connection established. size=19535008 KB / blksize=4096 B ----------->8-------------------------------------------------------- i read the other thread about "sock_sendmsg time expired, ko = xxxxx" but i didnt run into problems of lockups on neither primary nor secondary like ward did. i also stumbled across http://thread.gmane.org/gmane.comp.linux.drbd/5571 and i understand i have to tune the net sections [2] with my drbd devices. maybe sync-nice=0 is too harsh? i'm just not sure how, cos i am not really satisfied with the sync speed i get. i use intel gigabit nics on 2.4.24 with rx polling support (mtu 1500) for drbd. the two nodes are connected via crossover cable and sync speed maxes out at approx. 13MB/s while i know that the disks do better [3] and i know that the network interfaces can do better. i tested net troughput with netstrain and got nice figures [4] another note on synching: during fullsync pe was >> 0 on the sending side and ua >> 0 on the receiving side; is this normal? after sync both coloumns went back to zero. another oddity that showed up in the logs was: ----------->8-------------------------------------------------------- Feb 4 03:00:05 kaelte kernel: drbd2: pending_cnt <0 !!! ----------->8-------------------------------------------------------- didn't see that for quite some time; last time i saw it it came along with a deadlock on primary :-/ regards m [0] drbd disk-size upgrade ----------->8-------------------------------------------------------- # drop all connections to drbd services iptables -A INPUT -i eth0 -p tcp --destination-port ! 22 -j DROP cd /etc/init.d for i in postfix spamd courier-* httpd mysqld proftpd slapd do ./$i stop done ps ax # verify that services are all stopped for i in $(seq 0 5) do umount /dev/nb$i done ssh atem "drbd stop" cp /etc/drbd.conf.new /etc/drbd.conf /etc/init.d/drbd restart for i in 0 2 3 do e2fsck -f /dev/nb$i resize2fs /dev/nb$i done mount /dev/nb0 df # verify disksize umount /dev/nb0 # disksize ok? allright then: ssh atem "cp /etc/drbd.conf.new /etc/drbd.conf rm /var/lib/drbd/* /etc/init.d/drbd start" ----------->8-------------------------------------------------------- [1] ----------->8-------------------------------------------------------- kaelte:root# grep -c "sock_sendmsg time expired" \ /var/log/syslog.0 /var/log/syslog /var/log/syslog.0:374 /var/log/syslog:30 kaelte:root# ----------->8-------------------------------------------------------- [2] drbd.conf excerpt ----------->8-------------------------------------------------------- net { sync-rate=60000 sync-nice=0 tl-size=5000 timeout=10 connect-int=10 ping-int=10 } ----------->8-------------------------------------------------------- [3] disk performance local vs. drbd all filesystems are ext3 in ordered data mode, additionally drbd devices are mounted with "noatime" disks are two scsi uw 10k rpm 73GB in a logical RAID 1 array on each node with HP netraid 1M hardware raid controller ----------->8-------------------------------------------------------- kaelte:root# # local filesystem kaelte:root# sync; \ time dd if=/dev/zero of=/opt/400M bs=4k count=100000; \ time sync 100000+0 Records ein 100000+0 Records aus real 0m4.391s user 0m0.100s sys 0m3.750s real 0m12.580s user 0m0.000s sys 0m0.170s kaelte:root# # drbd fs kaelte:root# sync; \ time dd if=/dev/zero of=/var/spool/courier/400M bs=4k count=100000; \ time sync 100000+0 Records ein 100000+0 Records aus real 0m4.470s user 0m0.060s sys 0m4.050s real 0m36.050s user 0m0.000s sys 0m2.520s kaelte:root# ----------->8-------------------------------------------------------- [4] net throughput ----------->8-------------------------------------------------------- atem:bin# netstrain kaelte_priv 3333 send NetStrain 3.0 (c) 2002 Christoph Pfisterer <cp at chrisp.de> Looking up hostname kaelte_priv... Connecting to 192.168.0.3 port 3333 using IPv4... Connected sent: 2796M, 97707.8K/s total, 109141.1K/s current recv'd: 0B, 0B/s total, 0B/s current atem:bin# # wow! atem:bin# netstrain kaelte_priv 3333 both NetStrain 3.0 (c) 2002 Christoph Pfisterer <cp at chrisp.de> Looking up hostname kaelte_priv... Connecting to 192.168.0.3 port 3333 using IPv4... Connected sent: 805M, 34737.8K/s total, 40908.3K/s current recv'd: 800M, 34484.7K/s total, 40712.1K/s current atem:bin# # not bad... ----------->8-------------------------------------------------------- -- vi vi vi - editor of the beast vim vim vim - editor of the *ALL* *NEW* improved beast