[Pacemaker] Recovery after simple master-master failover
David Gubler
dg at doodle.com
Thu Feb 23 17:08:29 CET 2012
Hi Jake,
Thanks for your answer. I had another go today.
On 22.02.2012 00:09, Jake Smith wrote:
> Still probably not the nicest/cleanest solution but you could do a cronjob that runs 'crm resource reprobe node_name'. That will check for resources the cluster didn't start and prevent the cleanup actions.
Unfortunately that doesn't work, if the last error was a monitor
timeout. Oddly enough I have to do "crm resource cleanup apacheClone" -
not "apache" - to fix the state of the apache resource, even though the
monitor is part of the apache resource, not the clone. If I try both
variants with reprobe, nothing happens.
By the way, if I stop apache (/etc/init.d/apache2 stop), wait until
Pacemaker notices, and start it again, then Pacemaker also notices that
apache is back and moves the IPs accordingly!
Why does it matter to pacemaker whether the service is shut down
normally vs. a monitor timeout?
> what about an 'on-fail' in the op monitor section - probably with an =ignore?
> More on that one here:
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-operations.html
That doesn't help - Pacemaker sometimes (it's not deterministic and
often only happens on one of the two nodes) still stops and starts apache.
Even after reading the documentation several times, I still barely get
what on-fail=something is supposed to do. When I set e.g.
"on-fail=ignore" on the apache primitive, it has no apparent effects
(dito for restart) - Pacemaker acts exactly as if that option were not
set. Which kind of makes sense:
"The default for the stop operation is fence when STONITH is enabled and
block otherwise. All other operations default to stop."
Thus, "ignore" equals "stop", and "stop" equals "block" (since I don't
have STONITH). So what good is "ignore", if it's just another way of
saying "block"?
So I *suppose* what I'm seeing is that my failed apache resource gets
into the blocked state, and since "blocked" means "don't do anything
with that resource", no surprise it doesn't recover automatically. But I
still have now clue as to how I should do this instead...
Thanks,
David
--
David Gubler
Senior Software & Operations Engineer
MeetMe: http://doodle.com/david
E-Mail: dg at doodle.com
More information about the Pacemaker
mailing list