[Pacemaker] Help understanding why a failover occurred.

Mon Oct 18 07:32:30 UTC 2010

On 18 October 2010 04:03, Tim Serong <tserong at novell.com> wrote:

> On 10/16/2010 at 09:45 AM, Jai <awayatm at gmail.com> wrote:
> > I have setup a DRBD->Xen failover cluster. Last night at around 02:50 it
> failed
> > the resources from server "bravo" to "alpha". I'm trying to find out what
> > caused the failover of resources. I don't see anything in the logs that
> > indicate the cause but I don't really know what to look for. If someone
> could
> > help me understand these logs and what I'm looking for would be great.
> I'm
> > not even sure how far back I need to go.
>
> I reckon it's this:
>
> Oct 16 02:46:04 bravo attrd: [25098]: info: attrd_perform_update: Sent
> update 161: pingval=0
>
> Which suggests bravo lost connectivity to 12.12.12.1 around that time,
> causing
> the failover.
>
> For reference, if you're looking at pengine logs...  A few lines above
> where
> it says "info: process_pe_message: Transition NNN: PEngine Input stored in:
> /var/lib/pengine/pe-input-MMM.bz2", you'll see what it's about to do to
> your
> resources.  If this is just: "Leave resource FOO (Started/Master/Slave
> etc.)"
> that transition is probably boring.  If it says "Start FOO (...)" or
> "Promote/Demote/Stop FOO (...)", it means something has changed.  Scroll up
> a bit, to above where pengine is saying "unpack_config",
> "determine_node_status"
> etc. and you should see a message suggesting the cause for the change
> (failed
> op, timeout, ping attribute modified, etc.)  It might be a bit inscrutable
> sometimes, but it'll be there somewhere...
>
> HTH
>
>
These are very useful tips on understanding the logs
Pavlos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20101018/1fde621e/attachment-0002.htm>