Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 9/12/2011 6:43 PM, Florian Haas wrote: > On 2011-09-12 17:52, Roth, Steven (ESS Software) wrote: >> I have attempted to debug the kernel panic that I reported on this list >> last week, which has been reported by several others as well. The panic >> happens when DRBD is used in clusters based on corosync (either RHCS or >> Pacemaker), but only when those clusters are configured with multiple >> heartbeats (i.e., with “altname” specifications for the cluster nodes). >> The panic appears to be caused by two defects, one in the distributed >> lock manager (DLM, used by corosync) and one in the SCTP network >> protocol (which is used in clusters with multiple heartbeats). DRBD >> code triggers the panic but appears to be blameless for it. >> >> >> >> Disclaimer: I am not a Linux kernel expert; all of my kernel debugging >> expertise is on a different flavor of Unix. My assumptions or >> conclusions may be incorrect; I do not guarantee 100% accuracy of this >> analysis. Caveat lector. >> >> >> >> Environment: As will be clear from the analysis below, this defect can >> manifest in many ways. I debugged a particular manifestation that >> occurred with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64 >> (i.e., RHEL/CentOS 6.0). The manifestation I debugged was running a two >> node cluster, shutting down node A and starting it back up. Node B >> panics as soon as Node A starts back up. (See my previous mail for the >> defect signature.) >> >> >> >> When the cluster starts up, it creates a DLM “lockspace”. This causes >> the DLM code to create a socket for communication with the other nodes. >> Since we’re configured for multiple heartbeats, it’s an SCTP socket. >> DLM also creates a bunch of new kernel threads, among which is the >> dlm_recv thread, which listens for traffic on that socket. (Actually I >> see two of them, one per CPU.) You can see this in a “ps” listing. >> >> >> >> An important thing to note here is that all kernel threads are part of >> the same pseudo-process, and as such, they all share the same set of >> file descriptors. However, kernel threads do not normally (ever?) use >> file descriptors; they tend to work with file structures directly. The >> SCTP socket created above, for example, has the appropriate in-kernel >> socket structure, file structure, and inode structure, but it does not >> have a file descriptor. That’s as it should be. >> >> >> >> When node A starts back up, the SCTP protocol notices this (as it’s >> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART >> notification to the SCTP socket, telling the socket owner (the dlm_recv >> thread) that the other node has restarted. DLM responds by telling SCTP >> to create a clone of the master socket, for use in communicating with >> the newly restarted node. (This is an SCTP_SOCKOPT_PEELOFF request.) >> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is >> designed to be called from user space, not from a kernel thread, and so >> it /does/ allocate a file descriptor for the new socket. Since DLM is >> calling it from a kernel thread, the kernel thread now has an open file >> descriptor (#0) to that socket. And since kernel threads share the same >> file descriptor, every kernel thread on the system has this open >> descriptor. So defect #1 is that DLM is calling an SCTP user-space >> interface from a kernel thread, which results in pollution of the kernel >> thread file descriptor table. >> >> >> >> Meanwhile, DRBD has its own in-kernel code, running in a different >> kernel thread. And it detects (I didn’t bother to track down how) that >> its peer is back online. DRBD allows the user to configure handlers for >> events like that: user space programs that should be called when such an >> event occurs. So when DRBD notices that its peer is back, its kernel >> thread uses call_userhelper() to start a user-space instance of drbdadm >> to invoke any appropriate handlers. This is the invocation of drbdadm >> that we see in the panic report. (drbdadm gets invoked this way in >> response to a number of other possible events, as well, so this panic >> can manifest itself in other ways.) >> >> >> >> The key thing about this instance of drbdadm is that it was invoked by a >> kernel thread. Therefore it shouldn’t have any open file descriptors — >> but in this case, it does: it inherits fd 0 pointing to the SCTP >> socket. One of the first things that drbdadm does, when starting up, is >> call isatty(stdin) to find out how it should format its output. If it >> were called from user space, that would correctly check whether standard >> input was interactive. If it were called correctly from a kernel >> thread, there would be no stdin and it would correctly return an error. >> But what actually happens is that it calls isatty on the SCTP socket >> that is (incorrectly) in file descriptor 0. >> >> >> >> When ioctl is called on a socket, the sock_ioctl() function dereferences >> the socket data structure pointer (sk). Defect #2 is that the offending >> socket in this case has a null sk pointer. (I did not track down why, >> but presumably it’s a problem with the SCTP peel-off code.) So when >> sock_ioctl() derefences the pointer, the kernel panics. >> >> >> >> So, to recap: this panic occurs because (a) the drbdadm process is >> erroneously given an SCTP socket as its standard input, and (b) that >> socket’s data pointer is null, so it panics when drbdadm (reasonably) >> makes an ioctl call on its standard input. >> >> >> >> If you need a workaround for this panic, the best I can offer is to >> remove the “altname” specifications from the cluster configuration, set >> <totem rrp_mode=”none”> and <dlm protocol=”tcp”>, so that corosync uses >> TCP sockets instead of SCTP sockets. > > Wow. Pretty awesome analysis. This is something that the openais > (Corosync) mailing list should know about, but it currently seems > affected by the LF breach/outage. Thus, I'm CC'ing two key people here > directly. Fabio and Steve, could you take a look into this please? I am CC David and cluster-devel. David maintains DLM in kernel and userland. A few quick notes about using RRP/altname in a more general fashion. RRP/altname is expected to be Technology Preview state starting from RHEL6.2 (the technology will be there for users to test/try but not officially supported for production yet). We have not done a lot of intensive testing on the overall RHCS stack yet (except corosync, that btw does not use DLM) so there might be (== there are) bugs that we will have to address. Packages in RHEL6.2/Centos6.2 will have reasonable defaults and they are expected to work better (but far from being bug free) vs RHEL6.0/Centos6.0. It will take sometime before all the stack will be fully tested/supported in such environment but it is a work in progress. This report is extremely useful and surely will speed up things a lot. Thanks Fabio