[DRBD-user] Soft lockup CPU#2 stuck problems with DRBD 9 after rebooting other node

Thu Oct 20 16:15:54 CEST 2016

We are observing similar problems,
running proxmox 4.3, with pve-kernel 4.4.16-65 and drbd 9.0.4-1

Three node cluster, more than 60 drbd resources all with triple 
replication, hosts connected with multiple bonded gigabit ethernet,
network offloads disabled for eth* interfaces (rx off tx off sg off gro 
off rxvlan off txvlan off rxhash off), not sure if offload settings for 
the bond interface matter?
drbd resources sit on lvm (not thin-provisioned) on a battery-backed 
raid controller, some groups of resources share i/o limitations (hdd 
spindles), drbdmanage is in use.

When one of the nodes goes down (even during a clean reboot), a drbd 
resource can go dead on one of the remaining (still working) nodes
- we get the same soft lockup error in dmesg, and the affected resource 
stops responding to drbdadm commands.
In the exact setup described above we had a single dead resource, after 
only 1 reboot.

We observed this problem on earlier versions of drbd9 too, but then it 
was a lot worse
- we would get multiple dead resources, sometimes the whole kernel froze.

I am unable to reproduce this problem on a similar cluster with a lower 
number of resources (~10),
so for me it looks like some race condition depending on many resources 
changing state at the same time,
maybe resync starting on other resources taking up network and disk i/o, 
or maybe simply the time it takes to change states.

Can provide full kernel logs if needed.

regards
--
   Jan Janicki

On 2016-10-20 00:54, Maarten Bremer wrote:
> Hi,
>
> I have done some more investigation but I am still having a lot of 
> problems with CentOS 7, DRBD 9 and Xen. Is there anyone using the same 
> combination without issues?
>
> I did find a similar issue and suggestions in 
> http://lists.linbit.com/pipermail/drbd-user/2015-April/021938.html, 
> and tried disabling NIC offloading by using this:
>
> ethtool -K eno1 tso off gso off
>
> But as soon as I do a reboot or a network restart on one of the nodes, 
> everything is broken again with the CPU stuck errors, and I have to 
> reboot all servers.
>
> Anyone having any suggestions on what to try next?
>
> Kind regards,
>
> Maarten Bremer
>
>> We have problems with our three node DRBD 9 setup with CentOS 7 and Xen
>> 4. When one of our nodes is rebooted, or becomes unavailable, the other
>> nodes freeze entirely without any information, or give the following
>> message:
>>
>> kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s!
>> [drbd_w_db:1469]
>>
>> They then require a reboot, sometimes crashing one of the other nodes
>> again in the process. Not a fun thing in a HA setup...
>>
>> Does anyone have an idea what is going on, and what we can do to prevent
>> this from happening?
>>
>> We are running:
>>
>> DRBD 9.0.4-1 (api:2/proto:86-112)
>> CentOS 3.18.41-20.el7.x86_64
>> Xen 4.6.3-3.el7
>>
>> I do not know if it is related, but we are using bonding (mode 1,
>> active-backup) with two network adapters.
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user