[Pacemaker] Pacemaker very often STONITHs other node
Nikita Staroverov
nsforth at gmail.com
Mon Dec 9 10:34:10 UTC 2013
> Hello,
>
> Still did not receive any hints from you. And you are definitely my
> only hope before I switch to Proxmox or (even worse) some commercial
> stuff.
>
> At least can you tell mi if mode 4 could cause trouble with Corosync?
>
> Thanks!
>
According to your logs, posted before, the reason was:
Nov 23 15:18:50 rivendell-B lrmd: [9526]: WARN: XEN-acsystemy01:stop
process (PID 20760) timed out (try 1). Killing with signal SIGTERM (15).
Nov 23 15:18:50 rivendell-B lrmd: [9526]: WARN: operation stop[115] on
XEN-acsystemy01 for client 9529: pid 20760 timed out
Nov 23 15:18:50 rivendell-B crmd: [9529]: ERROR: process_lrm_event: LRM
operation XEN-acsystemy01_stop_0 (115) Timed Out (timeout=240000ms)
Then rivendell-A did its job:
Nov 23 15:18:45 rivendell-A crmd: [8840]: WARN: status_from_rc: Action
117 (XEN-acsystemy01_stop_0) on rivendell-B failed (target: 0 vs. rc:
-2): Error
Nov 23 15:18:45 rivendell-A crmd: [8840]: WARN: update_failcount:
Updating failcount for XEN-acsystemy01 on rivendell-B after failed stop:
rc=-2 (update=INFINITY, time=1385216325)
Nov 23 15:18:45 rivendell-A crmd: [8840]: info: abort_transition_graph:
match_graph_event:277 - Triggered transition abort (complete=0,
tag=lrm_rsc_op, id=XEN-acsystemy01_last_failure_0,
magic=2:-2;117:5105:0:e3a546ba-30f9-4d69-803a-d27b
0ef626c4, cib=0.3259.139) : Event failed
Nov 23 15:18:45 rivendell-A crmd: [8840]: notice: run_graph: ====
Transition 5105 (Complete=11, Pending=0, Fired=0, Skipped=28,
Incomplete=2, Source=/var/lib/pengine/pe-input-69.bz2): Stopped
Nov 23 15:18:45 rivendell-A crmd: [8840]: notice: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [
input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 23 15:18:45 rivendell-A pengine: [8839]: notice: unpack_config: On
loss of CCM Quorum: Ignore
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: unpack_rsc_op:
Processing failed op primitive-LVM:1_last_failure_0 on rivendell-B: not
running (7)
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: unpack_rsc_op:
Processing failed op XEN-acsystemy01_last_failure_0 on rivendell-B:
unknown exec error (-2)
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: pe_fence_node: Node
rivendell-B will be fenced to recover from resource failure(s)
Nov 23 15:18:45 rivendell-A pengine: [8839]: notice:
common_apply_stickiness: clone-LVM can fail 999999 more times on
rivendell-B before being forced off
Nov 23 15:18:45 rivendell-A pengine: [8839]: notice:
common_apply_stickiness: clone-LVM can fail 999999 more times on
rivendell-B before being forced off
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: stage6: Scheduling
Node rivendell-B for STONITH
So, what happens? :)
Rivendell-B tried to stop XEN-acsystemy01, but couldn't do that due to
time out of operation. Failure on stop operation is fatal by default and
leading to stonith.
Rivendell-A caught this and fence rivendell-B.
You also have got some other problems, like clone-LVM not running (but
it is'nt fatal).
I think your servers is overloaded due to one DRBD for all VM's. You
must increase timeout of operations or do something with cluster
configuration.
As for me, i use configuration with one drbd per virtual machine drive,
moderated timeouts and 802.3ad bonding configuration without problems.
More information about the Pacemaker
mailing list