Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello all, I'm experiencing weird crashes with drbd 8.0.13 when trying to resynchronize the secondary node. The secondary crashes (without any oops-es or other information in /var/log/messages) after some random period of resynchronization (around 20-30%). On the primary there is a 2.6.15.6 kernel and on the secondary I tried upgrading to 2.6.26.8. Now the resync went OK, but when I tested it again, it crashed again. This is a 64b kernel and the machine has Adaptec AIC7902 Ultra320 SCSI adapter with 4 disks in software RAID1 configuration. Interestingly, this problem started to appear when we replaced one disk in the RAID array. Another drbd-user thread which I had found suggests that this could be related to Supermicro motherboards. Indeed, there is SuperMicro X6DA8 G2 i7525 on the primary, but TYAN Thunder i7525 on the secondary (ie. the one which crashes). I've tried to load default settings on the Tyan board, but to no avail. Unfortunately, I don't have access to the servers physically, so I'm trying to come up with a software solution (if possible :) Could an upgrade to drbd 8.3.x help in this case? Thanks for replies/ideas Peter Linux vwsrv2 2.6.26.8 #1 SMP Tue Jan 27 20:57:52 GST 2009 x86_64 x86_64 x86_64 GNU/Linux version: 8.0.13 (api:86/proto:86) GIT-hash: ee3ad77563d2e87171a3da17cc002ddfd1677dbe Logs from primary: Feb 3 10:45:14 vwsrv1 kernel: drbd0: Began resync as SyncSource (will sync 4 KB [1 bits set]). Feb 3 10:45:14 vwsrv1 kernel: drbd0: Writing meta data super block now. Feb 3 10:45:14 vwsrv1 kernel: drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( UpToDate -> Inconsistent ) Feb 3 10:45:14 vwsrv1 kernel: drbd1: Began resync as PausedSyncS (will sync 2024832 KB [506208 bits set]). Feb 3 10:45:14 vwsrv1 kernel: drbd1: Writing meta data super block now. Feb 3 10:45:14 vwsrv1 kernel: drbd2: conn( WFBitMapS -> SyncSource ) Feb 3 10:45:14 vwsrv1 kernel: drbd2: Began resync as SyncSource (will sync 122896880 KB [30724220 bits set]). Feb 3 10:45:14 vwsrv1 kernel: drbd2: Writing meta data super block now. Feb 3 10:45:14 vwsrv1 kernel: drbd1: pdsk( Inconsistent -> UpToDate ) peer_isp( 0 -> 1 ) Feb 3 10:45:14 vwsrv1 kernel: drbd1: Writing meta data super block now. Feb 3 10:45:14 vwsrv1 kernel: drbd1: pdsk( UpToDate -> Inconsistent ) peer_isp( 1 -> 0 ) Feb 3 10:45:14 vwsrv1 kernel: drbd1: Writing meta data super block now. Feb 3 10:45:14 vwsrv1 kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 4 K/sec) Feb 3 10:45:14 vwsrv1 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Feb 3 10:45:14 vwsrv1 kernel: drbd1: conn( PausedSyncS -> SyncSource ) aftr_isp( 1 -> 0 ) Feb 3 10:45:15 vwsrv1 kernel: drbd1: Syncer continues. Feb 3 10:45:15 vwsrv1 kernel: drbd0: Writing meta data super block now. Feb 3 10:45:33 vwsrv1 kernel: drbd0: PingAck did not arrive in time. Feb 3 10:45:33 vwsrv1 kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Feb 3 10:45:33 vwsrv1 kernel: drbd0: asender terminated Feb 3 10:45:33 vwsrv1 kernel: drbd0: Terminating asender thread Feb 3 10:45:33 vwsrv1 kernel: drbd0: short read expecting header on sock: r=-512 Feb 3 10:45:33 vwsrv1 kernel: drbd0: Creating new current UUID Feb 3 10:45:33 vwsrv1 kernel: drbd0: Writing meta data super block now. Feb 3 10:45:33 vwsrv1 kernel: drbd0: tl_clear() Feb 3 10:45:33 vwsrv1 kernel: drbd0: Connection closed Feb 3 10:45:33 vwsrv1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Feb 3 10:45:33 vwsrv1 kernel: drbd0: receiver terminated Feb 3 10:45:33 vwsrv1 kernel: drbd0: Restarting receiver thread Feb 3 10:45:33 vwsrv1 kernel: drbd0: receiver (re)started Feb 3 10:45:33 vwsrv1 kernel: drbd0: conn( Unconnected -> WFConnection ) Feb 3 10:45:41 vwsrv1 kernel: drbd2: PingAck did not arrive in time. Feb 3 10:45:41 vwsrv1 kernel: drbd2: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) Feb 3 10:45:41 vwsrv1 kernel: drbd2: asender terminated Feb 3 10:45:41 vwsrv1 kernel: drbd2: Terminating asender thread Feb 3 10:45:41 vwsrv1 kernel: drbd1: PingAck did not arrive in time. Feb 3 10:45:41 vwsrv1 kernel: drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) Feb 3 10:45:41 vwsrv1 kernel: drbd1: asender terminated Feb 3 10:45:41 vwsrv1 kernel: drbd1: Terminating asender thread Feb 3 10:45:41 vwsrv1 kernel: drbd2: drbd_pp_alloc interrupted! Feb 3 10:45:41 vwsrv1 kernel: drbd2: alloc_ee: Allocation of a page failed Feb 3 10:45:41 vwsrv1 kernel: drbd2: error receiving RSDataRequest, l: 24! Feb 3 10:45:41 vwsrv1 kernel: drbd1: drbd_pp_alloc interrupted! Feb 3 10:45:41 vwsrv1 kernel: drbd1: alloc_ee: Allocation of a page failed Feb 3 10:45:41 vwsrv1 kernel: drbd1: error receiving RSDataRequest, l: 24! Feb 3 10:45:43 vwsrv1 kernel: drbd1: drbd_send_block() failed Feb 3 10:45:43 vwsrv1 kernel: drbd1: Writing meta data super block now. Feb 3 10:45:43 vwsrv1 kernel: drbd2: drbd_send_block() failed Feb 3 10:45:43 vwsrv1 kernel: drbd2: Writing meta data super block now. Feb 3 10:45:43 vwsrv1 kernel: drbd1: tl_clear() Feb 3 10:45:43 vwsrv1 kernel: drbd1: Connection closed For completeness, logs from secondary: Feb 3 10:45:28 vwsrv2 kernel: Total HugeTLB memory allocated, 0 Feb 3 10:45:28 vwsrv2 kernel: VFS: Disk quotas dquot_6.5.1 Feb 3 10:45:28 vwsrv2 kernel: Dquot-cache hash table entries: 512 (order 0, 4096 bytes) Feb 3 10:45:29 vwsrv2 kernel: msgmni has been set to 15985 Feb 3 10:45:29 vwsrv2 kernel: io scheduler noop registered Feb 3 10:50:07 vwsrv2 syslogd 1.4.1: restart. -- Peter LUCIAK (Peter.Luciak at iblsoft.com) IBL Software Engineering, http://www.iblsoft.com/ Mierová 103, 82105 Bratislava, Slovakia Phone: +421-2-32662111, Fax: +421-2-32662110 Direct: +421-2-32662175