Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear Philipp,
Today, "Unstable Outdated problem" does not occur yet.
But this CASE-16 similar problem occurred on Windows-DRBD(patched with your
latest version [86e4439])
1. Version
- 86e4439
2. Reproduce step
1) force "Outdate" state on each node by "drbdsetup outdate" command
2) promote one node to primary: "drbdadm primary --force r0"
3) file copy
4) result:
(1) file copy is pending at the beginning.
(2) primary status does not end in WFBitMapS status.
3. Questions
There are two set-positions of CONSIDER_RESYNC.
We think drbd_set_role() part may have some problem.
If this function sets the CONSIDER_RESYNC, the bitmap-exchange should be
occurred.
After then, if this bitmap-exchange starts, this hang problem will be
sometimes occurred or not.
So, we try to disable the bitmap-exchange like this;
receive_bitmap()
{
...
} else if (peer_device->repl_state[NOW] != L_WF_BITMAP_S) {
/* admin may have requested C_DISCONNECTING,
* other threads may have noticed network errors */
drbd_info(peer_device, "unexpected repl_state (%s) in receive_bitmap\n",
drbd_repl_str(peer_device->repl_state[NOW]));
#ifdef _WIN32_V9
err = -EIO;
goto out;
#endif
}
...
}
1) What do yo think about our workaround?
2) When does "outdated-outdated" status occur?
3) Could you explain more detail about CONSIDER_RESYNC meaning or purpose?
4. Logs
- Let me give you 2 logs, one is WindowsDRBD the other is LinuxDRBD
- Both have very similar pattern.
- Both maybe wait for bitmap response with WBitMapS status.
- And finally both be ended in failure of I/O hang.
1) [CASE-16] Windows DRBD log
- http://pastebin.com/Sm7pJyTa
2) [CASE-16] Linux DRBD Log
- http://pastebin.com/VhGyBAwT
Thanks.
2016-02-10 14:58 GMT+09:00 김재헌 <jhkim at mantech.co.kr>:
> Dear Philipp,
>
> 1. Test version
> - CentOS-7 Linux 3.10.0-229.7.2.el7.x86_64
> - Engine: V9.0.1-1
> -- GIT-hash: f57acfc22d29a95697e683fb6bbacd9a1ad4348e build by
> root at drbd9-02, 2016-02-09 09:46:20
>
>
> 2. Test scenario
>
> 1) status
>
> [root at drbd9-01 ~]# drbdadm status r0
> r0 role:Secondary
> disk:Outdated
> drbd9-02 role:Secondary
> peer-disk:Outdated
>
> 2) try mount
>
> ...........
>
> *Feb 9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Outdated
> -> Consistent ) repl( Established -> WFBitMapS )*
> Feb 9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: send bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
> Feb 9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
> Feb 9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
> node 1 found equal UUID (3D7DF30CF04727A4) for nodes 2 3
> Feb 9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
> C223F7CE3C9D7358 for node_id=2
> Feb 9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
> C223F7CE3C9D7358 for node_id=3
>
> Feb 9 14:19:35 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file
> system using the ext4 subsystem
> Feb 9 14:20:01 drbd9-01 systemd: Starting Session 13 of user root.
> Feb 9 14:20:01 drbd9-01 systemd: Started Session 13 of user root.
>
> Feb 9 14:21:49 drbd9-01 kernel: INFO: task mount:4758 blocked for more
> than 120 seconds.
> Feb 9 14:21:49 drbd9-01 kernel: "echo 0 > /proc/sys/
> *kernel/hung_task_timeout_secs*" disables this message.
> Feb 9 14:21:49 drbd9-01 kernel: mount D ffff88003d613680 0
> 4758 4685 0x00000080
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160406/fc66c743/attachment.htm>