[DRBD-user] Out of memory error when invoking fence-handler

Digimer lists at alteeve.ca
Mon Nov 10 15:32:32 CET 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 10/11/14 09:16 AM, Lars Ellenberg wrote:
> On Mon, Nov 10, 2014 at 09:00:45AM -0500, Digimer wrote:
>> On 10/11/14 04:11 AM, Lars Ellenberg wrote:
>>> On Sun, Nov 09, 2014 at 04:05:52PM -0500, Digimer wrote:
>>>> CentOS 6.6, DRBD 8.3.16.
>>>>
>>>> So this sucked:
>>>>
>>>> After rebooting and restoring, I retried and got the same result a
>>>> second time. After moving my VMs to the other node, I tested
>>>> crashing the other node and again saw the "out of mem, failed to
>>>> invoke fence-peer helper" message. After that, I rebooted both
>>>> nodes. I've not yet tested if that resolved the issue.
>>>>
>>>> Anyone seen this before?
>>>
>>>> *** Nov  9 15:18:40 fea-c01n01 kernel: block drbd0: out of mem,
>>>> failed to invoke fence-peer helper
>>>
>>> Sure.
>>>
>>> Your kernel is too new for this DRBD.
>>> Your DRBD is too old for this kernel.
>>>
>>>
>>> As you know, we sometimes start some "handlers".
>>> We spawn new kernel threads for this.
>>>
>>> One of the relevant functions is kthread_run
>>> (and everything it calles).
>>>
>>> That used to fail only for hard out of memory conditions.
>>> (Thus the "nonsense" error message)
>>>
>>> At some point, upstream kernel changed the internals
>>> of that code path to no longer do a wait_for_completion(),
>>> but to do a wait_for_completion_killable().
>>>
>>>> Nov  9 15:21:16 fea-c01n01 kernel:      Not tainted 2.6.32-504.el6.x86_64 #1
>>>
>>> And apparently RHEL 6.6. has backported that change.
>>>
>>> Which means that now this can also fail because of pending signals.
>>> DRBD routinely may have a signal pending in the calling thread there.
>>>
>>> Upstream fix:
>>> http://git.linbit.com/gitweb.cgi?p=drbd-8.4.git;a=commitdiff;h=e998365475194a8faf31a86081e88034d7bd1a41
>>
>> Another list user emailed me off list pointing to that fix as well.
>> Problem is, it doesn't match the 8.3.16 source I have...
>
> So?
> There is *exactly* one occurrence of kthread_run in the drbd source,
> and the patch consists of *exactly* one non-comment line,
> which is "+       flush_signals(current);"
>
>   ;-)
>
> Besides: time to finally get rid of 8.3, then.
>
>> ===
>> void drbd_try_outdate_peer_async(struct drbd_conf *mdev)
>> {
>>          struct task_struct *opa;
>>
>>          opa = kthread_run(_try_outdate_peer_async, mdev,
>> "drbd%d_a_helper", mdev_to_minor(mdev));
>>          if (IS_ERR(opa))
>>                  dev_err(DEV, "out of mem, failed to invoke
>> fence-peer helper\n");
>> }
>> ===
>>
>> Can I simply add the two missing lines?:
>>
>> ===
>> void drbd_try_outdate_peer_async(struct drbd_conf *mdev)
>> {
>>          struct task_struct *opa;
>>
>>          kref_get(&connection->kref);
>
> Don't add a kref get; that kref does not exist in 8.3 code.
> (It is also only _context_ line in the patch)
>
>>          flush_signals(current);
>>          opa = kthread_run(_try_outdate_peer_async, mdev,
>> "drbd%d_a_helper", mdev_to_minor(mdev));
>>          if (IS_ERR(opa))
>>                  dev_err(DEV, "out of mem, failed to invoke
>> fence-peer helper\n");
>> }

I knew it was context only, but it didn't match what was there so I 
wanted to clarify.

So to summarize, I only add:

====
          flush_signals(current);
====

I'll brush off my old RPM notes and see if I can sort out a patch. Thanks!

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?



More information about the drbd-user mailing list