[Pacemaker] Reason for cluster resource migration

Wed Feb 13 17:28:18 UTC 2013

----- Original Message -----
> From: "Andrew Beekhof" <andrew at beekhof.net>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, February 12, 2013 10:52:23 PM
> Subject: Re: [Pacemaker] Reason for cluster resource migration
> 
> On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amartin at xes-inc.com>
> wrote:
> > ----- Original Message -----
> >> From: "Andrew Beekhof" <andrew at beekhof.net>
> >> To: "The Pacemaker cluster resource manager"
> >> <pacemaker at oss.clusterlabs.org>
> >> Sent: Monday, February 11, 2013 10:11:53 PM
> >> Subject: Re: [Pacemaker] Reason for cluster resource migration
> >>
> >> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof
> >> <andrew at beekhof.net>
> >> wrote:
> >> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof
> >> > <andrew at beekhof.net> wrote:
> >> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin
> >> >> <amartin at xes-inc.com> wrote:
> >> >>> Hello,
> >> >>>
> >> >>> Unfortunately this same failure occurred again tonight,
> >> >>
> >> >> It might be the same effect, but there was no indication that
> >> >> the
> >> >> PE
> >> >> died last time.
> >> >>
> >> >>> taking down a production cluster. Here is the part of the log
> >> >>> where pengine died:
> >> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice:
> >> >>> pcmk_child_exit: Child process pengine terminated with signal
> >> >>> 6
> >> >>> (pid=19357, core=128)
> >> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice:
> >> >>> pcmk_child_exit: Respawning failed child process: pengine
> >> >>> Feb 11 17:05:16 storage0 pengine[12660]:   notice:
> >> >>> crm_add_logfile: Additional logging available in
> >> >>> /var/log/corosync.log
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read:
> >> >>> Connection to pengine failed
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error:
> >> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed
> >> >>> (I/O condition=25)
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:     crit:
> >> >>> pe_ipc_destroy:
> >> >>> Connection to the Policy Engine failed (pid=-1,
> >> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
> >> >>> save_cib_contents: Saved CIB contents after PE crash to
> >> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.
> >> >>>  bz2
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
> >> >>> Input I_ERROR from save_cib_contents() received in state
> >> >>> S_POLICY_ENGINE
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
> >> >>> do_state_transition: State transition S_POLICY_ENGINE ->
> >> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
> >> >>> origin=save_cib_contents ]
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover:
> >> >>> Action A_RECOVER (0000000001000000) not supported
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
> >> >>> do_election_vote:
> >> >>> Not voting in election, we're in state S_RECOVERY
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
> >> >>> Input I_TERMINATE from do_recover() received in state
> >> >>> S_RECOVERY
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
> >> >>> terminate_cs_connection: Disconnecting from Corosync
> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could
> >> >>> not recover from internal error
> >> >>>
> >> >>> The rest of the log:
> >> >>> http://sources.xes-inc.com/downloads/pengine.log
> >> >>> Looking through the full log, it seems that pengine recovers,
> >> >>
> >> >> Right, pacemakerd watches for this and restarts it.
> >> >>
> >> >>> but perhaps not quickly enough to prevent the STONITH and
> >> >>> resource migration?
> >> >>
> >> >> Highly likely.
> >> >> However the PE crashing is quite serious.  I'd like to get to
> >> >> the
> >> >> bottom of that ASAP.
> >> >>
> >> >>>
> >> >>> Here is the pe-core dump file mentioned in the log:
> >> >>> http://sources.xes-inc.com/downloads/pe-core.bz2
> >> >>
> >> >> Unfortunately core files are specific to the machine that
> >> >> generated them.
> >> >> If you create a crm_report for about that time, it will open it
> >> >> and
> >> >> record a backtrace for us to look at.
> >> >>
> >> >> Also very important is the contents of:
> >> >>    /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
> >> >
> >> > Ohhh, thats what the pe-core link was.
> >> > I've run it through crm_simulate but couldn't reproduce the
> >> > crash.
> >> >
> >> > So we'll still need the crm_report, it will have more detail on
> >> > the
> >> > "Child process pengine terminated with signal 6 (pid=19357,
> >> > core=128)"
> >> > part.
> >>
> >> Signal 6 is an assertion failure, but strangely there is no
> >> mention
> >> of
> >> one in syslog.
> >> Can you grep /var/log/corosync.log for lines containing 19357
> >> please?
> >>
> > Andrew,
> >
> > Thanks for the help. Here are the lines containing 19357:
> > http://sources.xes-inc.com/downloads/19357.log
> > cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource.
> > Postfix
> > is installed and running, so I am not sure why these failures are
> > occurring.
> >
> >> > The core file will likely be somewhere under
> >> > /var/lib/pacemaker/cores
> > That directory doesn't exist on this server, and it doesn't appear
> > to be in /var/crash either:
> 
> It looks like /var/lib/heartbeat/cores/ on your system.
> 
> > # ls /var/crash/ -ltr
> > total 67548
> > -rw-r----- 1 hacluster whoopsie  1293711 Feb  6 10:01
> > _usr_libexec_pacemaker_pengine.110.crash
> > ---------- 1 root      whoopsie 67874816 Feb 11 17:07
> > _usr_libexec_pacemaker_lrmd.0.crash
> > In case they would be helpful, here are those two files:
> > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash
> > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash
> >
> > Here is the crm_report from storage0 from this time period:
> > http://sources.xes-inc.com/downloads/pengine-report.tar.bz2
> 
> Are you sure?
> The pengine crashed on "Feb 11 17:05:15" but the report appears to be
> from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013"
> 
> There was one crash in there, but it was of the lrmd.
> Unfortunately it looks like the binaries and libraries have been
> stripped.
> 
> Where did you get them from?  Do you know how to install the -debug
> packages?

Andrew,

I ran crm_report again as follows:
# crm_report -f "2013-02-11 17:00:00" -t "2013-02-11 17:30:00" \
-n "storage0 storage1 storagequorum" -C /tmp/report
...
storage0:   Collecting data from  storage0 storage1 storagequorum (02/11/2013 05:00:00 PM to 02/11/2013 05:30:00 PM)
...
storage1:   Found core file: -rw-r----- 1 root root 18485248 Feb 11 17:10 /var/lib/heartbeat/cores/root/core.7678

Here is the report it generated:
http://sources.xes-inc.com/downloads/storage-report.bz2

I created these packages with checkinstall (using the normal Pacemaker
build process, but substituting checkinstall for "make install"). By
default it strips debugging information when generating the package,
which I thought was desireable for a production environment. I also
have a debug version of the package, which I will install now. I am
also working to build Ubuntu packages more officially using
dpkg-buildpackage. Is there a better way to create these packages? I
would prefer to not have to install build tools and compile the source
directly on production servers.

Thanks,

Andrew

> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>