[Pacemaker] Recovery after simple master-master failover

David Gubler dg at doodle.com
Thu Feb 23 17:08:29 CET 2012


Hi Jake,

Thanks for your answer. I had another go today.

On 22.02.2012 00:09, Jake Smith wrote:
> Still probably not the nicest/cleanest solution but you could do a cronjob that runs 'crm resource reprobe node_name'.  That will check for resources the cluster didn't start and prevent the cleanup actions.

Unfortunately that doesn't work, if the last error was a monitor 
timeout. Oddly enough I have to do "crm resource cleanup apacheClone" - 
not "apache" - to fix the state of the apache resource, even though the 
monitor is part of the apache resource, not the clone. If I try both 
variants with reprobe, nothing happens.

By the way, if I stop apache (/etc/init.d/apache2 stop), wait until 
Pacemaker notices, and start it again, then Pacemaker also notices that 
apache is back and moves the IPs accordingly!

Why does it matter to pacemaker whether the service is shut down 
normally vs. a monitor timeout?

> what about an 'on-fail' in the op monitor section - probably with an =ignore?
> More on that one here:
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-operations.html

That doesn't help - Pacemaker sometimes (it's not deterministic and 
often only happens on one of the two nodes) still stops and starts apache.

Even after reading the documentation several times, I still barely get 
what on-fail=something is supposed to do. When I set e.g. 
"on-fail=ignore" on the apache primitive, it has no apparent effects 
(dito for restart) - Pacemaker acts exactly as if that option were not 
set. Which kind of makes sense:

"The default for the stop operation is fence when STONITH is enabled and 
block otherwise. All other operations default to stop."

Thus, "ignore" equals "stop", and "stop" equals "block" (since I don't 
have STONITH). So what good is "ignore", if it's just another way of 
saying "block"?

So I *suppose* what I'm seeing is that my failed apache resource gets 
into the blocked state, and since "blocked" means "don't do anything 
with that resource", no surprise it doesn't recover automatically. But I 
still have now clue as to how I should do this instead...

Thanks,

David

-- 
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg at doodle.com



More information about the Pacemaker mailing list