[DRBD-user] kernel panic with DRBD: solved

Mon Sep 12 17:52:49 CEST 2011

I have attempted to debug the kernel panic that I reported on this list last week, which has been reported by several others as well.  The panic happens when DRBD is used in clusters based on corosync (either RHCS or Pacemaker), but only when those clusters are configured with multiple heartbeats (i.e., with "altname" specifications for the cluster nodes).  The panic appears to be caused by two defects, one in the distributed lock manager (DLM, used by corosync) and one in the SCTP network protocol (which is used in clusters with multiple heartbeats).  DRBD code triggers the panic but appears to be blameless for it.

Disclaimer:  I am not a Linux kernel expert; all of my kernel debugging expertise is on a different flavor of Unix.  My assumptions or conclusions may be incorrect; I do not guarantee 100% accuracy of this analysis.  Caveat lector.

Environment:  As will be clear from the analysis below, this defect can manifest in many ways.  I debugged a particular manifestation that occurred with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64 (i.e., RHEL/CentOS 6.0).  The manifestation I debugged was running a two node cluster, shutting down node A and starting it back up.  Node B panics as soon as Node A starts back up.  (See my previous mail for the defect signature.)

When the cluster starts up, it creates a DLM "lockspace".  This causes the DLM code to create a socket for communication with the other nodes.  Since we're configured for multiple heartbeats, it's an SCTP socket.  DLM also creates a bunch of new kernel threads, among which is the dlm_recv thread, which listens for traffic on that socket.  (Actually I see two of them, one per CPU.)  You can see this in a "ps" listing.

An important thing to note here is that all kernel threads are part of the same pseudo-process, and as such, they all share the same set of file descriptors.  However, kernel threads do not normally (ever?) use file descriptors; they tend to work with file structures directly.  The SCTP socket created above, for example, has the appropriate in-kernel socket structure, file structure, and inode structure, but it does not have a file descriptor.  That's as it should be.

When node A starts back up, the SCTP protocol notices this (as it's supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART notification to the SCTP socket, telling the socket owner (the dlm_recv thread) that the other node has restarted.  DLM responds by telling SCTP to create a clone of the master socket, for use in communicating with the newly restarted node.  (This is an SCTP_SOCKOPT_PEELOFF request.)  And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is designed to be called from user space, not from a kernel thread, and so it does allocate a file descriptor for the new socket.  Since DLM is calling it from a kernel thread, the kernel thread now has an open file descriptor (#0) to that socket.  And since kernel threads share the same file descriptor, every kernel thread on the system has this open descriptor.  So defect #1 is that DLM is calling an SCTP user-space interface from a kernel thread, which results in pollution of the kernel thread file descriptor table.

Meanwhile, DRBD has its own in-kernel code, running in a different kernel thread.  And it detects (I didn't bother to track down how) that its peer is back online.  DRBD allows the user to configure handlers for events like that: user space programs that should be called when such an event occurs.  So when DRBD notices that its peer is back, its kernel thread uses call_userhelper() to start a user-space instance of drbdadm to invoke any appropriate handlers.  This is the invocation of drbdadm that we see in the panic report.  (drbdadm gets invoked this way in response to a number of other possible events, as well, so this panic can manifest itself in other ways.)

The key thing about this instance of drbdadm is that it was invoked by a kernel thread.  Therefore it shouldn't have any open file descriptors - but in this case, it does: it inherits fd 0 pointing to the SCTP socket.  One of the first things that drbdadm does, when starting up, is call isatty(stdin) to find out how it should format its output.  If it were called from user space, that would correctly check whether standard input was interactive.  If it were called correctly from a kernel thread, there would be no stdin and it would correctly return an error.  But what actually happens is that it calls isatty on the SCTP socket that is (incorrectly) in file descriptor 0.

When ioctl is called on a socket, the sock_ioctl() function dereferences the socket data structure pointer (sk).  Defect #2 is that the offending socket in this case has a null sk pointer.  (I did not track down why, but presumably it's a problem with the SCTP peel-off code.)  So when sock_ioctl() derefences the pointer, the kernel panics.

So, to recap:  this panic occurs because (a) the drbdadm process is erroneously given an SCTP socket as its standard input, and (b) that socket's data pointer is null, so it panics when drbdadm (reasonably) makes an ioctl call on its standard input.

If you need a workaround for this panic, the best I can offer is to remove the "altname" specifications from the cluster configuration, set <totem rrp_mode="none"> and <dlm protocol="tcp">, so that corosync uses TCP sockets instead of SCTP sockets.

Regards,
Steven Roth
Hewlett-Packard Company

P.S.  Some readers of this mailing list may be frustrated by the lack of useful response from DRBD engineers.  I'd like to point out that the use of multiple heartbeats is a critical part of this defect scenario that was not mentioned in any of the panic reports (including mine).  I don't know if they tried, but DRBD engineers were not given sufficient information to reproduce the problem.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110912/347d47a1/attachment.htm>