[Pacemaker] Help understanding why a failover occurred.

Sun Oct 17 22:03:06 EDT 2010

On 10/16/2010 at 09:45 AM, Jai <awayatm at gmail.com> wrote: 
> I have setup a DRBD->Xen failover cluster. Last night at around 02:50 it failed  
> the resources from server "bravo" to "alpha". I'm trying to find out what  
> caused the failover of resources. I don't see anything in the logs that  
> indicate the cause but I don't really know what to look for. If someone could  
> help me understand these logs and what I'm looking for would be great. I'm  
> not even sure how far back I need to go. 

I reckon it's this:

Oct 16 02:46:04 bravo attrd: [25098]: info: attrd_perform_update: Sent update 161: pingval=0

Which suggests bravo lost connectivity to 12.12.12.1 around that time, causing
the failover.

For reference, if you're looking at pengine logs...  A few lines above where
it says "info: process_pe_message: Transition NNN: PEngine Input stored in:
/var/lib/pengine/pe-input-MMM.bz2", you'll see what it's about to do to your
resources.  If this is just: "Leave resource FOO (Started/Master/Slave etc.)"
that transition is probably boring.  If it says "Start FOO (...)" or
"Promote/Demote/Stop FOO (...)", it means something has changed.  Scroll up
a bit, to above where pengine is saying "unpack_config", "determine_node_status"
etc. and you should see a message suggesting the cause for the change (failed
op, timeout, ping attribute modified, etc.)  It might be a bit inscrutable
sometimes, but it'll be there somewhere...

HTH

Tim

-- 
Tim Serong <tserong at novell.com>
Senior Clustering Engineer, OPS Engineering, Novell Inc.