[Pacemaker] Pacemaker very often STONITHs other node

Mon Dec 9 10:34:10 UTC 2013

> Hello,
>
> Still did not receive any hints from you. And you are definitely my 
> only hope before I switch to Proxmox or (even worse) some commercial 
> stuff.
>
> At least can you tell mi if mode 4 could cause trouble with Corosync?
>
> Thanks!
>
According to your logs, posted before, the reason was:

Nov 23 15:18:50 rivendell-B lrmd: [9526]: WARN: XEN-acsystemy01:stop 
process (PID 20760) timed out (try 1).  Killing with signal SIGTERM (15).
Nov 23 15:18:50 rivendell-B lrmd: [9526]: WARN: operation stop[115] on 
XEN-acsystemy01 for client 9529: pid 20760 timed out
Nov 23 15:18:50 rivendell-B crmd: [9529]: ERROR: process_lrm_event: LRM 
operation XEN-acsystemy01_stop_0 (115) Timed Out (timeout=240000ms)

Then rivendell-A did its job:

Nov 23 15:18:45 rivendell-A crmd: [8840]: WARN: status_from_rc: Action 
117 (XEN-acsystemy01_stop_0) on rivendell-B failed (target: 0 vs. rc: 
-2): Error
Nov 23 15:18:45 rivendell-A crmd: [8840]: WARN: update_failcount: 
Updating failcount for XEN-acsystemy01 on rivendell-B after failed stop: 
rc=-2 (update=INFINITY, time=1385216325)
Nov 23 15:18:45 rivendell-A crmd: [8840]: info: abort_transition_graph: 
match_graph_event:277 - Triggered transition abort (complete=0, 
tag=lrm_rsc_op, id=XEN-acsystemy01_last_failure_0, 
magic=2:-2;117:5105:0:e3a546ba-30f9-4d69-803a-d27b
0ef626c4, cib=0.3259.139) : Event failed
Nov 23 15:18:45 rivendell-A crmd: [8840]: notice: run_graph: ==== 
Transition 5105 (Complete=11, Pending=0, Fired=0, Skipped=28, 
Incomplete=2, Source=/var/lib/pengine/pe-input-69.bz2): Stopped
Nov 23 15:18:45 rivendell-A crmd: [8840]: notice: do_state_transition: 
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ 
input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 23 15:18:45 rivendell-A pengine: [8839]: notice: unpack_config: On 
loss of CCM Quorum: Ignore
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: unpack_rsc_op: 
Processing failed op primitive-LVM:1_last_failure_0 on rivendell-B: not 
running (7)
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: unpack_rsc_op: 
Processing failed op XEN-acsystemy01_last_failure_0 on rivendell-B: 
unknown exec error (-2)
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: pe_fence_node: Node 
rivendell-B will be fenced to recover from resource failure(s)
Nov 23 15:18:45 rivendell-A pengine: [8839]: notice: 
common_apply_stickiness: clone-LVM can fail 999999 more times on 
rivendell-B before being forced off
Nov 23 15:18:45 rivendell-A pengine: [8839]: notice: 
common_apply_stickiness: clone-LVM can fail 999999 more times on 
rivendell-B before being forced off
Nov 23 15:18:45 rivendell-A pengine: [8839]: WARN: stage6: Scheduling 
Node rivendell-B for STONITH

So, what happens? :)
Rivendell-B tried to stop XEN-acsystemy01, but couldn't do that due to 
time out of operation. Failure on stop operation is fatal by default and 
leading to stonith.
Rivendell-A caught this and fence rivendell-B.
You also have got some other problems, like clone-LVM not running (but 
it is'nt fatal).

I think your servers is overloaded due to one DRBD for all VM's. You 
must increase timeout of operations or do something with cluster 
configuration.
As for me, i use configuration with one drbd per virtual machine drive, 
moderated timeouts and 802.3ad bonding configuration without problems.