[ClusterLabs] cluster does not detect kill on pacemaker process ?

Fri Apr 7 10:58:50 EDT 2017

On 04/05/2017 05:16 PM, neeraj ch wrote:
> Hello All, 
> 
> I noticed something on our pacemaker test cluster. The cluster is
> configured to manage an underlying database using master slave primitive. 
> 
> I ran a kill on the pacemaker process, all the other nodes kept showing
> the node online. I went on to kill the underlying database on the same
> node which would have been detected had the pacemaker on the node been
> online. The cluster did not detect that the database on the node has
> failed, the failover never occurred. 
> 
> I went on to kill corosync on the same node and the cluster now marked
> the node as stopped and proceeded to elect a new master. 
> 
> 
> In a separate test. I killed the pacemaker process on the cluster DC,
> the cluster showed no change. I went on to change CIB on a different
> node. The CIB modify command timed out. Once that occurred, the node
> didn't failover even when I turned off corosync on cluster DC. The
> cluster didn't recover after this mishap. 
> 
> Is this expected behavior? Is there a solution for when OOM decides to
> kill the pacemaker process? 
> 
> I run pacemaker 1.1.14, with corosync 1.4. I have stonith disabled and
> quorum enabled. 
> 
> Thank you,
> 
> nwarriorch

What exactly are you doing to kill pacemaker? There are multiple
pacemaker processes, and they have different recovery methods.

Also, what OS/version are you running? If it has systemd, that can play
a role in recovery as well.

Having stonith disabled is a big part of what you're seeing. When a node
fails, stonith is the only way the rest of the cluster can be sure the
node is unable to cause trouble, so it can recover services elsewhere.