[Pacemaker] Reason for cluster resource migration

Tue Feb 12 04:01:56 UTC 2013

On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin <amartin at xes-inc.com> wrote:
> Hello,
>
> Unfortunately this same failure occurred again tonight,

It might be the same effect, but there was no indication that the PE
died last time.

> taking down a production cluster. Here is the part of the log where pengine died:
> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Child process pengine terminated with signal 6 (pid=19357, core=128)
> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice: pcmk_child_exit: Respawning failed child process: pengine
> Feb 11 17:05:16 storage0 pengine[12660]:   notice: crm_add_logfile: Additional logging available in /var/log/corosync.log
> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read: Connection to pengine failed
> Feb 11 17:05:16 storage0 crmd[19358]:    error: mainloop_gio_callback: Connection to pengine[0x891680] closed (I/O condition=25)
> Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy: Connection to the Policy Engine failed (pid=-1, uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
> Feb 11 17:05:16 storage0 crmd[19358]:   notice: save_cib_contents: Saved CIB contents after PE crash to /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.  bz2
> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_ERROR from save_cib_contents() received in state S_POLICY_ENGINE
> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=save_cib_contents ]
> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover: Action A_RECOVER (0000000001000000) not supported
> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote: Not voting in election, we're in state S_RECOVERY
> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
> Feb 11 17:05:16 storage0 crmd[19358]:   notice: terminate_cs_connection: Disconnecting from Corosync
> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could not recover from internal error
>
> The rest of the log:
> http://sources.xes-inc.com/downloads/pengine.log
> Looking through the full log, it seems that pengine recovers,

Right, pacemakerd watches for this and restarts it.

> but perhaps not quickly enough to prevent the STONITH and resource migration?

Highly likely.
However the PE crashing is quite serious.  I'd like to get to the
bottom of that ASAP.

>
> Here is the pe-core dump file mentioned in the log:
> http://sources.xes-inc.com/downloads/pe-core.bz2

Unfortunately core files are specific to the machine that generated them.
If you create a crm_report for about that time, it will open it and
record a backtrace for us to look at.

Also very important is the contents of:
   /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2

>
> Thanks,
>
> Andrew
>
>
>
>
> ----- Original Message -----
>> From: "Andrew Martin" <amartin at xes-inc.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Friday, February 1, 2013 4:32:26 PM
>> Subject: Re: [Pacemaker] Reason for cluster resource migration
>>
>> ----- Original Message -----
>> > From: "Andrew Beekhof" <andrew at beekhof.net>
>> > To: "The Pacemaker cluster resource manager"
>> > <pacemaker at oss.clusterlabs.org>
>> > Sent: Thursday, December 6, 2012 8:36:27 PM
>> > Subject: Re: [Pacemaker] Reason for cluster resource migration
>> >
>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amartin at xes-inc.com>
>> > wrote:
>> > > Hello,
>> > >
>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1
>> > > quorum node in
>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and
>> > > Corosync
>> > > 2.1.0. My cluster configuration is:
>> > > http://pastebin.com/6TPkWtbt
>> > >
>> > > Recently, pengine died on storage0 (where the resources were
>> > > running) which
>> > > also happened to be the DC at the time. Consequently, Pacemaker
>> > > went into
>> > > recovery mode and released its role as DC, at which point
>> > > storage1
>> > > took over
>> > > the DC role and migrated the resources away from storage0 and
>> > > onto
>> > > storage1.
>> > > Looking through the logs, it seems like storage0 came back into
>> > > the
>> > > cluster
>> > > before the migration of the resources began:
>> > > Dec 03 08:31:20 [3165] storage1       crmd:     info:
>> > > peer_update_callback:
>> > > Client storage0/peer now has status [online] (DC=true)
>> > > ...
>> > > Dec 03 08:31:20 [3164] storage1    pengine:   notice: LogActions:
>> > > Start   rscXXX    (storage1)
>> > >
>> > > Thus, why did the migration occur, rather than aborting and
>> > > having
>> > > the
>> > > resources simply remain running on storage0? Here are the logs
>> > > from
>> > > each of
>> > > the nodes:
>> > > storage0: http://pastebin.com/ZqqnH9uf
>> > > storage1: http://pastebin.com/rvSLVcZs
>> >
>> > Hmm, thats an interesting one.
>> > Can you provide this file?  It will hold the answer:
>> >
>> > Dec 03 08:31:31 [3164] storage1    pengine:   notice:
>> > process_pe_message:         Calculated Transition 1:
>> > /var/lib/pacemaker/pengine/pe-input-28.bz2
>> >
>> >
>> > >
>> > > Thanks,
>> > >
>> > > Andrew
>> > >
>> > > _______________________________________________
>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started:
>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > > Bugs: http://bugs.clusterlabs.org
>> > >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>> Andrew,
>>
>> Sorry for the delayed response. Here is the file you requested:
>> http://sources.xes-inc.com/downloads/pe-input-28.bz2
>>
>> This same condition just occurred again on storage1 today (pengine
>> died, and then storage1 was STONITHed).
>>
>> Thanks,
>>
>> Andrew
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org