[Csync2] actions are incorrectly run on local host, possible problem with signal(SIGCHLD, SIG_IGN)

Thu Jul 10 18:05:52 CEST 2008

Summary
-------

csync2 is running on hostA and hostB in standalone server mode (-ii),
syncing a single small text file.

csync2 -ii (standalone server mode) sets signal(SIGCHLD,SIG_IGN),
which ultimately prevents the removal of entries in the "actions"
table.  Thus, actions on hostB are executed via hostA running "csync2
-x" but not cleared from the table. This results in subsequent runs on
hostB of "csync2 -cr /" unexpectedly running the actions too.

Question for csync2 developers, in csync2.c line 239:

  signal(SIGCHLD, SIG_IGN);

Is that SIG_IGN correct and necessary, especially in SERVER_MODE
(-ii)?  My problem goes away if I comment out that line.

Details
-------

csync-1.34 running on two hosts, hostA and hostB.

/etc/csync2.cfg:
group failovergroup
{
  host hostA hostB;
  key /etc/csync2_shared_key;
  include /etc/sysconfig/service.conf;
  action
  {
    pattern /etc/sysconfig/service.conf;
    exec "/etc/rc.d/init.d/service start";
    logfile "/var/log/csync2_action.log";
  }
}

On both hosts, I have csync2 running under daemontools "supervise".

http://cr.yp.to/daemontools/supervise.html

This is important, because this is the root of why I think I do things
differently than most users, and thus why my problem probably doesn't
come up very much, if ever.

I start the service via a SysV style init script that starts
supervise, then runs:

/var/service/csync2/run:
#!/bin/sh
. /etc/sysconfig/network
logger -p daemon.err -t "csync2[$$]" "supervise starting csync2"
exec /usr/sbin/csync2 -N $HOSTNAME -ii

Daemontools users will wonder why I'm using a SysV style init script
rather than "svscan".  But that is out of scope.  Just know that the
process tree looks like this (eg. pstree -p output):

init(1)-+
        |-supervise(21919)---csync2(21921)

Now run "csync2 -cr /" on both hostA and hostB to prepare their databases.

Now edit the watched file /etc/sysconfig/service.conf on hostA.

[root at hostA ~]# csync2 -cr /
[root at hostA ~]# csync2 -M
chary   hostA       hostB       /etc/sysconfig/service.conf
[root at hostA ~]#

On hostB see that there are not yet any currently scheduled actions.

[root at hostB ~]# sqlite /var/lib/csync2/hostB.db
SQLite version 2.8.17
Enter ".help" for instructions
sqlite> select * from action;
sqlite>

Now sync from hostA via "csync x".

Observe that the action was executed on hostB.

[root at hostB ~]# tail /var/log/csync2_action.log
/etc/rc.d/init.d/service was run

But now observe on hostB that the action table still has that action present:

[root at hostB ~]# sqlite /var/lib/csync2/hostB.db
SQLite version 2.8.17
Enter ".help" for instructions
sqlite> select * from action;
/etc/sysconfig/service.conf|/etc/rc.d/init.d/service%20start|/var/log/csync2_action.log
sqlite>

Why is the action still present in the action table?

Observe the relevant portions of an strace of csync2 running on hostB:

...
21921 waitpid(21922,  <unfinished ...>
...
21922 execve("/bin/sh", ["sh", "-c", "/etc/rc.d/init.d/pipefilter
star"...], [/*
25 vars */]) = 0
...
21921 <... waitpid resumed> NULL, 0)    = -1 ECHILD (No child processes)
21921 write(2, "<21921> ", 8)           = 8
21921 write(2, "ERROR: Waitpid returned error No"..., 50) = 50
...

That last line is the csync2 writing the error:
<21921> ERROR: Waitpid returned error No child process

That's our clue. This comes from action.c line 114:

  if ( waitpid(pid, 0, 0) < 0 )
    csync_fatal("ERROR: Waitpid returned error %s.\n", strerror(errno));

  for (t = tl; t != 0; t = t->next)
    SQL("Remove action entry",
        "DELETE FROM action WHERE command = '%s' "
        "and logfile = '%s' and filename = '%s'",
        command, logfile, t->value);

There, we see that we fatally return before we remove the action entries.

So why is waitpid returning -1?

In csync2.c line 239:

  signal(SIGCHLD, SIG_IGN);

See man waitpid:

"If  the calling process has SA_NOCLDWAIT set or has SIGCHLD set to
SIG_IGN, and the process has no unwaited-for children  that  were
transformed  into zombie processes, the calling thread shall block
until all of the children of the process containing the calling
thread  terminate,  and  wait()  and waitpid() shall fail and set
errno to [ECHILD]."

So, the calling process set SIG_IGN, which results in waitpid always
fails and sets errno to ECHILD.

An empirical test is to comment out line 239:

  //signal(SIGCHLD, SIG_IGN);

And indeed, csync2 -N $HOSTNAME -ii on hostB thereafter reports the
correct child status, and thus clears the action table:

...
22371 waitpid(22372,  <unfinished ...>
...
22372 execve("/bin/sh", ["sh", "-c", "/etc/rc.d/init.d/pipefilter
star"...], [/*
25 vars */]) = 0
...
22371 <... waitpid resumed> NULL, 0)    = 22372
22371 --- SIGCHLD (Child exited) @ 0 (0) ---
...
22371 fstat64(5, {st_mode=S_IFREG|0644, st_size=13312, ...}) = 0
22371 _llseek(5, 0, [0], SEEK_SET)      = 0
22371 read(5, "** This file contains an SQLite "..., 1024) = 1024
...

So, finally, is it safe to remove the SIG_IGN?  Should it be wrapped
in a conditional testing for SERVER_MODE?

I note on csync2.c line 461:

  /* Stand-alone server mode. This is a hack..
   */
  if ( mode == MODE_SERVER || mode == MODE_SINGLE ) {
    if (csync_server_loop(mode == MODE_SINGLE)) return 1;
    mode = MODE_INETD;
  }

Perhaps "This is a hack" is another clue?  Perhaps we need
csync_server_loop to be aware of a subtle difference between
MODE_SINGLE and MODE_SERVER?