Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, been trying to google for this, but haven't found anything quite matching. Sorry if this is covered elsewhere. Using: RHEL 6 / DRBD 8.3.10-2 kmod from ElRepo / OCFS2 compiled from Redhat's kernel source. I have a setup with a Primary/Primary OCFS2 setup, which I set up using the following instructions: http://wiki.virtastic.com/display/howto/Clustered+Filesystem+with+DRBD +and+OCFS2+on+CentOS+5.5 Anyways, I have a pair of bonded GigE network ports hooked up to a switch to link up the two servers. I wanted it to be on a dedicated network, but networking just gave me regular IPs which are on the same subnet as the primary interfaces. SELinux is permissive. Everything was working, but then... I made a change to IPTables, and did a "service iptables restart", and next thing I knew, I had a split brain. And worse, somehow the OCFS2 filesystem started giving errors. I don't know if it really was a corruption, but the error messages came in pretty fast. I recovered from the split-brain manually, but that didn't stop the messages. It didn't clear up even with fsck.ocfs2. I finally had to find out what that inode was pointing to and remove it before the messages stopped. Testing later, it looks like sometimes when I restart IPtables, I get a split-brain. But I haven't replicated the OCFS2 corruption. I would have thought that short time that IPtables restarts in wouldn't cause a split-brain, but I guess it does sometimes. Not sure why it sometimes gets a split and sometimes not. Is this normal? Should I use "iptables -A" to add rules instead of doing a restart? Would posting the /etc/drbd.conf and /etc/sysconfig/iptables help? Any other info? I got the following /var/log/messages after restarting iptables: Apr 18 07:52:27 server-2 kernel: block drbd1: asender terminated Apr 18 07:52:27 server-2 kernel: block drbd1: Terminating asender thread Apr 18 07:52:27 server-2 kernel: block drbd1: sock_sendmsg returned -32 Apr 18 07:52:27 server-2 kernel: block drbd1: short sent ReportUUIDs size=56 sent=0 Apr 18 07:52:27 server-2 kernel: block drbd1: Connection closed Apr 18 07:52:28 server-2 kernel: block drbd1: conn( NetworkFailure -> Unconnected ) Apr 18 07:52:28 server-2 kernel: block drbd1: receiver terminated Apr 18 07:52:28 server-2 kernel: block drbd1: Restarting receiver thread Apr 18 07:52:28 server-2 kernel: block drbd1: receiver (re)started Apr 18 07:52:28 server-2 kernel: block drbd1: conn( Unconnected -> WFConnection ) Apr 18 07:52:28 server-2 kernel: block drbd1: Handshake successful: Agreed network protocol version 96 Apr 18 07:52:28 server-2 kernel: block drbd1: conn( WFConnection -> WFReportParams ) Apr 18 07:52:28 server-2 kernel: block drbd1: Starting asender thread (from drbd1_receiver [5944]) Apr 18 07:52:28 server-2 kernel: block drbd1: data-integrity-alg: <not-used> Apr 18 07:52:28 server-2 kernel: block drbd1: drbd_sync_handshake: Apr 18 07:52:28 server-2 kernel: block drbd1: self 7891B6FC1469AE31:F7F25E6B00607741:571973CB1489F5B9:571873CB1489F5B9 bits:1 flags:0 Apr 18 07:52:28 server-2 kernel: block drbd1: peer AA6330CFB23C2663:F7F25E6B00607741:571973CB1489F5B9:571873CB1489F5B9 bits:73 flags:0 Apr 18 07:52:28 server-2 kernel: block drbd1: uuid_compare()=100 by rule 90 Apr 18 07:52:28 server-2 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 Apr 18 07:52:29 server-2 kernel: block drbd1: helper command: /sbin/drbdadm initial-split-brain minor-1 exit code 0 (0x0) Apr 18 07:52:29 server-2 kernel: block drbd1: Split-Brain detected but unresolved, dropping connection! Apr 18 07:52:29 server-2 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 Apr 18 07:52:29 server-2 notify-split-brain.sh[8606]: invoked for res0 Apr 18 07:52:29 server-2 kernel: block drbd1: helper command: /sbin/drbdadm split-brain minor-1 exit code 0 (0x0) Apr 18 07:52:29 server-2 kernel: block drbd1: conn( WFReportParams -> Disconnecting ) Apr 18 07:52:29 server-2 kernel: block drbd1: error receiving ReportState, l: 4! Apr 18 07:52:29 server-2 kernel: block drbd1: meta connection shut down by peer. Apr 18 07:52:29 server-2 kernel: block drbd1: asender terminated Apr 18 07:52:29 server-2 kernel: block drbd1: Terminating asender thread Apr 18 07:52:29 server-2 kernel: block drbd1: Connection closed Apr 18 07:52:29 server-2 kernel: block drbd1: conn( Disconnecting -> StandAlone ) Apr 18 07:52:29 server-2 kernel: block drbd1: receiver terminated Apr 18 07:52:29 server-2 kernel: block drbd1: Terminating receiver thread And after that, the following messages kept coming in really fast... rebooting, switching primary nodes, fsck, all didn't work. Only finding the actual owner of the inode and removing it worked: Apr 18 07:53:07 server-2 kernel: (8163,0):ocfs2_read_virt_blocks:853 ERROR: Inode #5377026 contains a hole at offset 466944 Apr 18 07:53:07 server-2 kernel: (8163,0):ocfs2_read_dir_block:533 ERROR: status = -5 Apr 18 07:53:08 server-2 kernel: (8163,12):ocfs2_read_virt_blocks:853 ERROR: Inode #5377026 contains a hole at offset 466944 Apr 18 07:53:08 server-2 kernel: (8163,12):ocfs2_read_dir_block:533 ERROR: status = -5 Apr 18 07:53:08 server-2 kernel: (8508,0):ocfs2_read_virt_blocks:853 ERROR: Inode #5377026 contains a hole at offset 466944 Apr 18 07:53:08 server-2 kernel: (8508,0):ocfs2_read_dir_block:533 ERROR: status = -5 Apr 18 07:53:08 server-2 kernel: (8508,0):ocfs2_read_virt_blocks:853 ERROR: Inode #5377026 contains a hole at offset 466944 Apr 18 07:53:08 server-2 kernel: (8508,0):ocfs2_read_dir_block:533 ERROR: status = -5 Apr 18 07:53:08 server-2 kernel: (8163,16):ocfs2_read_virt_blocks:853 ERROR: Inode #5377026 contains a hole at offset 466944 Apr 18 07:53:08 server-2 kernel: (8163,16):ocfs2_read_dir_block:533 ERROR: status = -5 Thanks! Herman -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110517/c17f9136/attachment.htm>