[Pacemaker] crmd exits and restarts after failback

Fri Dec 10 08:54:23 UTC 2010

On Wed, Dec 8, 2010 at 11:58 AM, Simon Jansen
<simon.jansen1 at googlemail.com> wrote:
> Hi,
>
> I have set up a pacemaker cluster on Ubuntu 10.04 LTS Server.
> Further I wrote an multistate OCF RA for the Rsyslog service. This RA passes
> all tests that are run by the ocf-tester tool.
>
> Now the problem:
> When I firstly start the msSyslog resource it promotes on node1 and is fully
> functional. After that I set node1 to standby. The other node (node2) takes
> the master role. This behaviour is just as expected. Then I set node1 to
> online again to test if the failback works. There the error occurs: the crmd
> exits and starts again. These actions occur in an endless loop and I can
> just reboot both nodes several times to come in a functional state again.
> I attached a summary of the log file so that you can see what's happening
> exactly. In addition I attached the Rsyslog RA and the cluster config.
>
> Maybe someone has a clue why the crmd is restarting all the time after the
> failback. I think that there is an error in the Rsyslog RA because the
> cluster works fine when I stop the Rsyslog resource manually.

Here's the reason:

Dec  8 11:15:14 node1 crmd: [31284]: ERROR: send_ipc_message: IPC
Channel to 31285 is not connected
Dec  8 11:15:14 node1 crmd: [31284]: ERROR: do_pe_invoke_callback:
Could not contact the pengine
Dec  8 11:15:14 node1 crmd: [31284]: info: do_pe_invoke_callback:
Invoking the PE: query=32, ref=pe_calc-dc-1291803314-10, seq=736,
quorate=1
Dec  8 11:15:14 node1 crmd: [31284]: info: pe_msg_dispatch: Received
HUP from pengine:[31285]
Dec  8 11:15:14 node1 crmd: [31284]: CRIT: pe_connection_destroy:
Connection to the Policy Engine failed (pid=31285,
uuid=2525f074-89f6-468e-8900-14d278808c31)
...
Dec  8 11:15:15 node1 corosync[898]:   [pcmk  ] ERROR:
pcmk_wait_dispatch: Child process pengine terminated with signal 11
(pid=31285, core=false)

The policy engine appears to be crashing and this is causing the crmd
to restart as part of the recovery.
Perhaps file a bug with the Ubuntu guys to suck in a more recent
version of pacemaker.

If it still occurs with 1.0.10, add "ulimit -c unlimited" to the
openais init script to be sure that a core file is produced (so we can
figure out where/why).

>
> --
>
>
> Regards,
>
> Simon Jansen
>
>
> ---------------------------
> Simon Jansen
> 64291 Darmstadt
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>