[DRBD-user] [CASE-16] WFBitMapS status is not ended and mount command is hung

Wed Apr 6 10:00:48 CEST 2016

Dear Philipp,

Today, "Unstable Outdated problem" does not occur yet.

But this CASE-16 similar problem occurred on Windows-DRBD(patched with your
latest version [86e4439])

1. Version
 - 86e4439

2. Reproduce step

 1) force "Outdate" state on each node by "drbdsetup outdate" command
 2) promote one node to primary: "drbdadm primary --force r0"
 3) file copy
 4) result:
      (1) file copy is pending at the beginning.
      (2) primary status does not end in WFBitMapS status.

3. Questions
There are two set-positions of CONSIDER_RESYNC.
We think drbd_set_role() part may have some problem.
If this function sets the CONSIDER_RESYNC, the bitmap-exchange should be
occurred.
After then, if this bitmap-exchange starts, this hang problem will be
sometimes occurred or not.
So, we try to disable the bitmap-exchange like this;

receive_bitmap()
{
...
} else if (peer_device->repl_state[NOW] != L_WF_BITMAP_S) {
/* admin may have requested C_DISCONNECTING,
* other threads may have noticed network errors */
drbd_info(peer_device, "unexpected repl_state (%s) in receive_bitmap\n",
   drbd_repl_str(peer_device->repl_state[NOW]));
#ifdef _WIN32_V9
err = -EIO;
goto out;
#endif
}
...
}

  1) What do yo think about our workaround?
  2) When does "outdated-outdated" status occur?
  3) Could you explain more detail about CONSIDER_RESYNC meaning or purpose?

4. Logs
 - Let me give you 2 logs, one is WindowsDRBD the other is LinuxDRBD
 - Both have very similar pattern.
 - Both maybe wait for bitmap response with WBitMapS status.
 - And finally both be ended in failure of I/O hang.

 1) [CASE-16] Windows DRBD log
     - http://pastebin.com/Sm7pJyTa

 2) [CASE-16] Linux DRBD Log
    - http://pastebin.com/VhGyBAwT

Thanks.

2016-02-10 14:58 GMT+09:00 김재헌 <jhkim at mantech.co.kr>:

> Dear Philipp,
>
> 1. Test version
>  - CentOS-7 Linux 3.10.0-229.7.2.el7.x86_64
>  - Engine: V9.0.1-1
>      --  GIT-hash: f57acfc22d29a95697e683fb6bbacd9a1ad4348e build by
> root at drbd9-02, 2016-02-09 09:46:20
>
>
> 2. Test scenario
>
> 1) status
>
> [root at drbd9-01 ~]# drbdadm status r0
> r0 role:Secondary
>   disk:Outdated
>   drbd9-02 role:Secondary
>     peer-disk:Outdated
>
> 2) try mount
>

> ...........
>

> *Feb  9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: pdsk( Outdated
> -> Consistent ) repl( Established -> WFBitMapS )*
> Feb  9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: send bitmap
> stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
> Feb  9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: unexpected
> Feb  9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: In UUIDs from
> node 1 found equal UUID (3D7DF30CF04727A4) for nodes 2 3
> Feb  9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
> C223F7CE3C9D7358 for node_id=2
> Feb  9 14:19:31 drbd9-01 kernel: drbd r0/0 drbd1 drbd9-02: I have
> C223F7CE3C9D7358 for node_id=3
>
> Feb  9 14:19:35 drbd9-01 kernel: EXT4-fs (drbd1): mounting ext3 file
> system using the ext4 subsystem
> Feb  9 14:20:01 drbd9-01 systemd: Starting Session 13 of user root.
> Feb  9 14:20:01 drbd9-01 systemd: Started Session 13 of user root.
>
> Feb  9 14:21:49 drbd9-01 kernel: INFO: task mount:4758 blocked for more
> than 120 seconds.
> Feb  9 14:21:49 drbd9-01 kernel: "echo 0 > /proc/sys/
> *kernel/hung_task_timeout_secs*" disables this message.
> Feb  9 14:21:49 drbd9-01 kernel: mount           D ffff88003d613680     0
> 4758   4685 0x00000080
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160406/fc66c743/attachment.htm>