[Pacemaker] Reason for cluster resource migration

Tue Feb 12 15:04:46 UTC 2013

----- Original Message -----
> From: "Andrew Beekhof" <andrew at beekhof.net>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, February 11, 2013 10:11:53 PM
> Subject: Re: [Pacemaker] Reason for cluster resource migration
> 
> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <andrew at beekhof.net>
> wrote:
> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof
> > <andrew at beekhof.net> wrote:
> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin
> >> <amartin at xes-inc.com> wrote:
> >>> Hello,
> >>>
> >>> Unfortunately this same failure occurred again tonight,
> >>
> >> It might be the same effect, but there was no indication that the
> >> PE
> >> died last time.
> >>
> >>> taking down a production cluster. Here is the part of the log
> >>> where pengine died:
> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice:
> >>> pcmk_child_exit: Child process pengine terminated with signal 6
> >>> (pid=19357, core=128)
> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice:
> >>> pcmk_child_exit: Respawning failed child process: pengine
> >>> Feb 11 17:05:16 storage0 pengine[12660]:   notice:
> >>> crm_add_logfile: Additional logging available in
> >>> /var/log/corosync.log
> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read:
> >>> Connection to pengine failed
> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error:
> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed
> >>> (I/O condition=25)
> >>> Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy:
> >>> Connection to the Policy Engine failed (pid=-1,
> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
> >>> save_cib_contents: Saved CIB contents after PE crash to
> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.
> >>>  bz2
> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
> >>> Input I_ERROR from save_cib_contents() received in state
> >>> S_POLICY_ENGINE
> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
> >>> do_state_transition: State transition S_POLICY_ENGINE ->
> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
> >>> origin=save_cib_contents ]
> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover:
> >>> Action A_RECOVER (0000000001000000) not supported
> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote:
> >>> Not voting in election, we're in state S_RECOVERY
> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
> >>> Input I_TERMINATE from do_recover() received in state S_RECOVERY
> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
> >>> terminate_cs_connection: Disconnecting from Corosync
> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could
> >>> not recover from internal error
> >>>
> >>> The rest of the log:
> >>> http://sources.xes-inc.com/downloads/pengine.log
> >>> Looking through the full log, it seems that pengine recovers,
> >>
> >> Right, pacemakerd watches for this and restarts it.
> >>
> >>> but perhaps not quickly enough to prevent the STONITH and
> >>> resource migration?
> >>
> >> Highly likely.
> >> However the PE crashing is quite serious.  I'd like to get to the
> >> bottom of that ASAP.
> >>
> >>>
> >>> Here is the pe-core dump file mentioned in the log:
> >>> http://sources.xes-inc.com/downloads/pe-core.bz2
> >>
> >> Unfortunately core files are specific to the machine that
> >> generated them.
> >> If you create a crm_report for about that time, it will open it
> >> and
> >> record a backtrace for us to look at.
> >>
> >> Also very important is the contents of:
> >>    /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
> >
> > Ohhh, thats what the pe-core link was.
> > I've run it through crm_simulate but couldn't reproduce the crash.
> >
> > So we'll still need the crm_report, it will have more detail on the
> > "Child process pengine terminated with signal 6 (pid=19357,
> > core=128)"
> > part.
> 
> Signal 6 is an assertion failure, but strangely there is no mention
> of
> one in syslog.
> Can you grep /var/log/corosync.log for lines containing 19357 please?
> 
Andrew,

Thanks for the help. Here are the lines containing 19357:
http://sources.xes-inc.com/downloads/19357.log
cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. Postfix 
is installed and running, so I am not sure why these failures are occurring.

> > The core file will likely be somewhere under
> > /var/lib/pacemaker/cores
That directory doesn't exist on this server, and it doesn't appear to be in /var/crash either:
# ls /var/crash/ -ltr
total 67548
-rw-r----- 1 hacluster whoopsie  1293711 Feb  6 10:01 _usr_libexec_pacemaker_pengine.110.crash
---------- 1 root      whoopsie 67874816 Feb 11 17:07 _usr_libexec_pacemaker_lrmd.0.crash
In case they would be helpful, here are those two files:
http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash
http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash

Here is the crm_report from storage0 from this time period:
http://sources.xes-inc.com/downloads/pengine-report.tar.bz2

Thanks,

Andrew

> > but crm_report should be able to find it.
> >
> >>
> >>>
> >>> Thanks,
> >>>
> >>> Andrew
> >>>
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Andrew Martin" <amartin at xes-inc.com>
> >>>> To: "The Pacemaker cluster resource manager"
> >>>> <pacemaker at oss.clusterlabs.org>
> >>>> Sent: Friday, February 1, 2013 4:32:26 PM
> >>>> Subject: Re: [Pacemaker] Reason for cluster resource migration
> >>>>
> >>>> ----- Original Message -----
> >>>> > From: "Andrew Beekhof" <andrew at beekhof.net>
> >>>> > To: "The Pacemaker cluster resource manager"
> >>>> > <pacemaker at oss.clusterlabs.org>
> >>>> > Sent: Thursday, December 6, 2012 8:36:27 PM
> >>>> > Subject: Re: [Pacemaker] Reason for cluster resource migration
> >>>> >
> >>>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin
> >>>> > <amartin at xes-inc.com>
> >>>> > wrote:
> >>>> > > Hello,
> >>>> > >
> >>>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and
> >>>> > > 1
> >>>> > > quorum node in
> >>>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8
> >>>> > > and
> >>>> > > Corosync
> >>>> > > 2.1.0. My cluster configuration is:
> >>>> > > http://pastebin.com/6TPkWtbt
> >>>> > >
> >>>> > > Recently, pengine died on storage0 (where the resources were
> >>>> > > running) which
> >>>> > > also happened to be the DC at the time. Consequently,
> >>>> > > Pacemaker
> >>>> > > went into
> >>>> > > recovery mode and released its role as DC, at which point
> >>>> > > storage1
> >>>> > > took over
> >>>> > > the DC role and migrated the resources away from storage0
> >>>> > > and
> >>>> > > onto
> >>>> > > storage1.
> >>>> > > Looking through the logs, it seems like storage0 came back
> >>>> > > into
> >>>> > > the
> >>>> > > cluster
> >>>> > > before the migration of the resources began:
> >>>> > > Dec 03 08:31:20 [3165] storage1       crmd:     info:
> >>>> > > peer_update_callback:
> >>>> > > Client storage0/peer now has status [online] (DC=true)
> >>>> > > ...
> >>>> > > Dec 03 08:31:20 [3164] storage1    pengine:   notice:
> >>>> > > LogActions:
> >>>> > > Start   rscXXX    (storage1)
> >>>> > >
> >>>> > > Thus, why did the migration occur, rather than aborting and
> >>>> > > having
> >>>> > > the
> >>>> > > resources simply remain running on storage0? Here are the
> >>>> > > logs
> >>>> > > from
> >>>> > > each of
> >>>> > > the nodes:
> >>>> > > storage0: http://pastebin.com/ZqqnH9uf
> >>>> > > storage1: http://pastebin.com/rvSLVcZs
> >>>> >
> >>>> > Hmm, thats an interesting one.
> >>>> > Can you provide this file?  It will hold the answer:
> >>>> >
> >>>> > Dec 03 08:31:31 [3164] storage1    pengine:   notice:
> >>>> > process_pe_message:         Calculated Transition 1:
> >>>> > /var/lib/pacemaker/pengine/pe-input-28.bz2
> >>>> >
> >>>> >
> >>>> > >
> >>>> > > Thanks,
> >>>> > >
> >>>> > > Andrew
> >>>> > >
> >>>> > > _______________________________________________
> >>>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>> > >
> >>>> > > Project Home: http://www.clusterlabs.org
> >>>> > > Getting started:
> >>>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> > > Bugs: http://bugs.clusterlabs.org
> >>>> > >
> >>>> >
> >>>> > _______________________________________________
> >>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>> >
> >>>> > Project Home: http://www.clusterlabs.org
> >>>> > Getting started:
> >>>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> > Bugs: http://bugs.clusterlabs.org
> >>>> >
> >>>>
> >>>> Andrew,
> >>>>
> >>>> Sorry for the delayed response. Here is the file you requested:
> >>>> http://sources.xes-inc.com/downloads/pe-input-28.bz2
> >>>>
> >>>> This same condition just occurred again on storage1 today
> >>>> (pengine
> >>>> died, and then storage1 was STONITHed).
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Andrew
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started:
> >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://bugs.clusterlabs.org
> >>>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>