[DRBD-user] kernel panic with DRBD: solved

Florian Haas f.g.haas at gmx.net
Mon Sep 12 18:43:53 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 2011-09-12 17:52, Roth, Steven (ESS Software) wrote:
> I have attempted to debug the kernel panic that I reported on this list
> last week, which has been reported by several others as well.  The panic
> happens when DRBD is used in clusters based on corosync (either RHCS or
> Pacemaker), but only when those clusters are configured with multiple
> heartbeats (i.e., with “altname” specifications for the cluster nodes). 
> The panic appears to be caused by two defects, one in the distributed
> lock manager (DLM, used by corosync) and one in the SCTP network
> protocol (which is used in clusters with multiple heartbeats).  DRBD
> code triggers the panic but appears to be blameless for it.
> 
>  
> 
> Disclaimer:  I am not a Linux kernel expert; all of my kernel debugging
> expertise is on a different flavor of Unix.  My assumptions or
> conclusions may be incorrect; I do not guarantee 100% accuracy of this
> analysis.  Caveat lector.
> 
>  
> 
> Environment:  As will be clear from the analysis below, this defect can
> manifest in many ways.  I debugged a particular manifestation that
> occurred with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64
> (i.e., RHEL/CentOS 6.0).  The manifestation I debugged was running a two
> node cluster, shutting down node A and starting it back up.  Node B
> panics as soon as Node A starts back up.  (See my previous mail for the
> defect signature.)
> 
>  
> 
> When the cluster starts up, it creates a DLM “lockspace”.  This causes
> the DLM code to create a socket for communication with the other nodes. 
> Since we’re configured for multiple heartbeats, it’s an SCTP socket. 
> DLM also creates a bunch of new kernel threads, among which is the
> dlm_recv thread, which listens for traffic on that socket.  (Actually I
> see two of them, one per CPU.)  You can see this in a “ps” listing.
> 
>  
> 
> An important thing to note here is that all kernel threads are part of
> the same pseudo-process, and as such, they all share the same set of
> file descriptors.  However, kernel threads do not normally (ever?) use
> file descriptors; they tend to work with file structures directly.  The
> SCTP socket created above, for example, has the appropriate in-kernel
> socket structure, file structure, and inode structure, but it does not
> have a file descriptor.  That’s as it should be.
> 
>  
> 
> When node A starts back up, the SCTP protocol notices this (as it’s
> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART
> notification to the SCTP socket, telling the socket owner (the dlm_recv
> thread) that the other node has restarted.  DLM responds by telling SCTP
> to create a clone of the master socket, for use in communicating with
> the newly restarted node.  (This is an SCTP_SOCKOPT_PEELOFF request.) 
> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is
> designed to be called from user space, not from a kernel thread, and so
> it /does/ allocate a file descriptor for the new socket.  Since DLM is
> calling it from a kernel thread, the kernel thread now has an open file
> descriptor (#0) to that socket.  And since kernel threads share the same
> file descriptor, every kernel thread on the system has this open
> descriptor.  So defect #1 is that DLM is calling an SCTP user-space
> interface from a kernel thread, which results in pollution of the kernel
> thread file descriptor table.
> 
>  
> 
> Meanwhile, DRBD has its own in-kernel code, running in a different
> kernel thread.  And it detects (I didn’t bother to track down how) that
> its peer is back online.  DRBD allows the user to configure handlers for
> events like that: user space programs that should be called when such an
> event occurs.  So when DRBD notices that its peer is back, its kernel
> thread uses call_userhelper() to start a user-space instance of drbdadm
> to invoke any appropriate handlers.  This is the invocation of drbdadm
> that we see in the panic report.  (drbdadm gets invoked this way in
> response to a number of other possible events, as well, so this panic
> can manifest itself in other ways.)
> 
>  
> 
> The key thing about this instance of drbdadm is that it was invoked by a
> kernel thread.  Therefore it shouldn’t have any open file descriptors —
> but in this case, it does: it inherits fd 0 pointing to the SCTP
> socket.  One of the first things that drbdadm does, when starting up, is
> call isatty(stdin) to find out how it should format its output.  If it
> were called from user space, that would correctly check whether standard
> input was interactive.  If it were called correctly from a kernel
> thread, there would be no stdin and it would correctly return an error. 
> But what actually happens is that it calls isatty on the SCTP socket
> that is (incorrectly) in file descriptor 0.
> 
>  
> 
> When ioctl is called on a socket, the sock_ioctl() function dereferences
> the socket data structure pointer (sk).  Defect #2 is that the offending
> socket in this case has a null sk pointer.  (I did not track down why,
> but presumably it’s a problem with the SCTP peel-off code.)  So when
> sock_ioctl() derefences the pointer, the kernel panics.
> 
>  
> 
> So, to recap:  this panic occurs because (a) the drbdadm process is
> erroneously given an SCTP socket as its standard input, and (b) that
> socket’s data pointer is null, so it panics when drbdadm (reasonably)
> makes an ioctl call on its standard input.
> 
>  
> 
> If you need a workaround for this panic, the best I can offer is to
> remove the “altname” specifications from the cluster configuration, set
> <totem rrp_mode=”none”> and <dlm protocol=”tcp”>, so that corosync uses
> TCP sockets instead of SCTP sockets.

Wow. Pretty awesome analysis. This is something that the openais
(Corosync) mailing list should know about, but it currently seems
affected by the LF breach/outage. Thus, I'm CC'ing two key people here
directly. Fabio and Steve, could you take a look into this please?

Cheers,
Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110912/e2b7c29d/attachment.pgp>


More information about the drbd-user mailing list