[Pacemaker] pacemaker/corosync: a resource is started on 2 nodes

Wed Jan 28 10:20:51 UTC 2015

Hi!

I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources. 

corosync  ver. 1.4.7-1. 
pacemaker  ver  1.1.11.
os: ubuntu 12.04. 

Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)

Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output:

Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]

 Pgpool2	(ocf::heartbeat:pgpool):	FAILED (unmanaged) [ lb-node2 lb-node1 ]
 Resource Group: IPGroup
     FailoverIP1	(ocf::heartbeat:IPaddr2):	Started [ lb-node2 lb-node1 ]

As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen. 

this is the output of my crm configure show:

node db-node1 \
	attributes standby=on
node db-node2 \
	attributes standby=on
node lb-node1
node lb-node2
primitive Cachier ocf:site:cachier \
	op monitor interval=10s timeout=30s depth=10 \
	meta target-role=Started
primitive FailoverIP1 IPaddr2 \
	params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
	op monitor interval=30s
primitive Mailer ocf:site:mailer \
	meta target-role=Started \
	op monitor interval=10s timeout=30s depth=10
primitive Memcached memcached \
	op monitor interval=10s timeout=30s depth=10 \
	meta target-role=Started
primitive Nginx nginx \
	params status10url="/nginx_status" testclient=curl port=8091 \
	op monitor interval=10s timeout=30s depth=10 \
	op start interval=0 timeout=40s \
	op stop interval=0 timeout=60s \
	meta target-role=Started
primitive Pgpool2 pgpool \
	params checkmethod=pid \
	op monitor interval=30s \
	op start interval=0 timeout=40s \
	op stop interval=0 timeout=60s
group IPGroup FailoverIP1 \
	meta target-role=Started
colocation ip-with-cachier inf: Cachier IPGroup
colocation ip-with-mailer inf: Mailer IPGroup
colocation ip-with-memcached inf: Memcached IPGroup
colocation ip-with-nginx inf: Nginx IPGroup
colocation ip-with-pgpool inf: Pgpool2 IPGroup
order cachier-after-ip inf: IPGroup Cachier
order mailer-after-ip inf: IPGroup Mailer
order memcached-after-ip inf: IPGroup Memcached
order nginx-after-ip inf: IPGroup Nginx
order pgpool-after-ip inf: IPGroup Pgpool2
property cib-bootstrap-options: \
	expected-quorum-votes=4 \
	stonith-enabled=false \
	default-resource-stickiness=100 \
	maintenance-mode=false \
	dc-version=1.1.10-9d39a6b \
	cluster-infrastructure="classic openais (with plugin)" \
	last-lrm-refresh=1422438144

So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?  

Thanks in advance.

--
Best regards,
Sergey Arlashin