[Pacemaker] Cluster crash

Hugo Deprez hugo.deprez at gmail.com
Thu Feb 23 10:17:19 UTC 2012


I don't think so has, I do have over similar cluster on the same network
and didn't have any issues.
The only thing I can detect was that the virtual machine was like
unresponsive.
But I think the VM crash was not like a power shutdown more like very slow
then totaly crash.

Even if the drbd-nagios resource timeout, it should failover on the other
node no ?

Regards,


On 20 February 2012 12:35, Andrew Beekhof <andrew at beekhof.net> wrote:

> On Mon, Feb 13, 2012 at 9:57 PM, Hugo Deprez <hugo.deprez at gmail.com>
> wrote:
> > Hello,
> >
> > does anyone have an idea ?
>
> Well I see:
>
> Feb  8 12:59:05 server01 crmd: [19470]: ERROR: process_lrm_event: LRM
> operation drbd-nagios:1_monitor_15000 (90) Timed Out (timeout=20000ms)
> Feb  8 13:00:05 server01 crmd: [19470]: WARN: cib_rsc_callback:
> Resource update 415 failed: (rc=-41) Remote node did not respond
> Feb  8 13:06:36 server01 crmd: [19470]: notice: ais_dispatch:
> Membership 128: quorum lost
>
> which looks suspicious.  Network problem?
>
> >
> > it seems that at 13:06:38 resources et started on slave member.
> > But then there is something wrong on server01 :
> >
> > Feb  8 13:06:39 server01 pengine: [19469]: info: determine_online_status:
> > Node server01 is online
> > Feb  8 13:06:39 server01 pengine: [19469]: notice: unpack_rsc_op:
> Operation
> > apache2_monitor_0 found resource apache2 active on server01
> > Feb  8 13:06:39 server01 pengine: [19469]: notice: group_print:  Resource
> > Group: supervision-grp
> > Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> > fs-data    (ocf::heartbeat:Filesystem):    Stopped
> > Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> > nagios-ip    (ocf::heartbeat:IPaddr2):    Stopped
> > Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> > apache2    (ocf::heartbeat:apache):    Started server01
> > Feb  8 13:06:39 server01 pengine: [19469]: notice: native_print:
> > nagios    (lsb:nagios3):    Stopped
> >
> >
> > But I don't understand what fails if this is DRBD or apache2 causes the
> > issue.
> >
> > Any idea ?
> >
> >
> >
> > On 10 February 2012 09:39, Hugo Deprez <hugo.deprez at gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> please found attach to this mail the corosync logs.
> >> If you have any tips :)
> >>
> >>
> >>
> >> Regards,
> >>
> >> Hugo
> >>
> >>
> >> On 8 February 2012 15:39, Florian Haas <florian at hastexo.com> wrote:
> >>>
> >>> On Wed, Feb 8, 2012 at 2:29 PM, Hugo Deprez <hugo.deprez at gmail.com>
> >>> wrote:
> >>> > Dear community,
> >>> >
> >>> > I am currently running different corosync / drbd cluster using VM
> >>> > running on
> >>> > vmware esxi host.
> >>> > Guest Os are Debian Squeeze.
> >>> >
> >>> > the active member of the cluster just freeze the VM was unreachable.
> >>> > But the resources didn't achieved to move to the other node.
> >>> >
> >>> > My cluster has the following ressources :
> >>> >
> >>> > Resource Group: grp
> >>> >      fs-data    (ocf::heartbeat:Filesystem):
> >>> >      nagios-ip  (ocf::heartbeat:IPaddr2):
> >>> >      apache2    (ocf::heartbeat:apache):
> >>> >      nagios     (lsb:nagios3):
> >>> >      pnp        (lsb:npcd):
> >>> >
> >>> >
> >>> > I am currently troubleshooting this issue. I don't really know where
> to
> >>> > look. Of course I had a look at the logs, but it is pretty hard for
> me
> >>> > to
> >>> > understand what happen.
> >>>
> >>> It's pretty hard for anyone else to understand _without_ logs. :)
> >>>
> >>> > I noticed that the VM crash at 12:09 and that the cluster only try to
> >>> > move
> >>> > the ressources at  12:58, this does not make sens for me. Or maybe
> the
> >>> > host
> >>> > wasn't totaly down ?
> >>> >
> >>> > Do you have any idea how I can troubleshoot ?
> >>>
> >>> Log analysis is where I would start.
> >>>
> >>> > Last thing, I notice that If I start apache2 on the slave server,
> >>> > corosync
> >>> > didn't detect that the resource is started, could that be an issue ?
> >>>
> >>> Sure it could, but Pacemaker should happily recover from that.
> >>>
> >>> Cheers,
> >>> Florian
> >>>
> >>> --
> >>> Need help with High Availability?
> >>> http://www.hastexo.com/now
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120223/6ad6f5b4/attachment.htm>


More information about the Pacemaker mailing list