[DRBD-user] Randomly crash drbdsetup

Thu Jun 28 13:58:01 CEST 2018

Hello Roland,
First, thanks for your concern about my question.

Le 28/06/2018 à 08:41, Roland Kammerer a écrit :
> On Wed, Jun 27, 2018 at 12:37:20PM +0200, Julien Escario wrote:
>> Hello,
>> We're experiencing a really strange situation.
>> We often play with :
>> drbdmanage peer-device-options --resource <ressource> --c-max-rate <rate>
>>
>> especially when a node crash and need a (full) resync.
>>
>> When doing this, sometimes (after 10 or 20 such commands), we end up with
>> drbdmanage completely stuck and a drbdsetup that seems to block on an IO with
>> returning.
>> For example :
>> drbdsetup disk-options 144 --set-defaults --read-balancing=prefer-local
>> --al-extents=6481 --al-updates=no --md-flushes=no
>>
>> drbdadm status display ressource up to this one then hangs on drbdsetup call.
>>
>> drbdtop is still usable.
>>
>> Right now, we didn't manage to find a solution without rebooting the node (sadly).
>>
>> Do you experience such situation ?
>> What can cause this ?
> 
> What version of DRBD9 is that (cat /proc/drbd)? drbdsetup hangs for a
> reason, kernel related, not an actual bug in drbdsetup. "dmesg" at that
> time would be interesting. Yes, I saw that, but not recently, only with
> by now pretty old versions of DRBD9.

I posted a full dump of kernel messages here :
https://framabin.org/p/?988ac2e36beabde6#a6UmS1uK/idqlPCCgoIar8oeEcNjRf8kCmdlECPu+V4=

Versions :
cat /proc/drbd
version: 9.0.12-1 (api:2/proto:86-112)
Transports (api:16): tcp (9.0.12-1)

drbd-utils                                          9.3.0-1
drbdmanage-proxmox                                  2.1-1

Is that too old ?

Can my problem be caused by two nodes only setup ? Is 3 nodes the required
minimum for correct operation ? (even if I'm aware it's the recommended setup).

>> Is there a way to unblock this process without rebooting ?
> 
> Depends, but as a rule of thumb: when that happens the kernel is already
> in a state where you want to/have to reboot.

We can also see memory errors automatically corrected by ECC :
Jun 28 04:02:37 vm8 kernel: [159134.557190] {24}[Hardware Error]: Hardware error
from APEI Generic Hardware Error Source: 1
Jun 28 04:02:37 vm8 kernel: [159134.557191] {24}[Hardware Error]: It has been
corrected by h/w and requires no further action
Jun 28 04:02:37 vm8 kernel: [159134.557192] {24}[Hardware Error]: event
severity: corrected
Jun 28 04:02:37 vm8 kernel: [159134.557193] {24}[Hardware Error]:  Error 0,
type: corrected
Jun 28 04:02:37 vm8 kernel: [159134.557195] {24}[Hardware Error]:  fru_text:
CorrectedErr
Jun 28 04:02:37 vm8 kernel: [159134.557197] {24}[Hardware Error]:
section_type: memory error
Jun 28 04:02:37 vm8 kernel: [159134.557198] {24}[Hardware Error]:   node: 0
device: 1
Jun 28 04:02:37 vm8 kernel: [159134.557200] {24}[Hardware Error]:   error_type:
2, single-bit ECC

Perhaps related, I don't know. drbdbsetup process did also hangs on 'sane' node.

But a night of memtest (3 complete passes) didn't detect any error.

Best regards,
Julien Escario