[Pacemaker] crm_resource -L not trustable right after restart

Mon Feb 17 19:55:23 EST 2014

On 22 Jan 2014, at 10:54 am, Brian J. Murrell (brian) <brian at interlinx.bc.ca> wrote:

> On Thu, 2014-01-16 at 14:49 +1100, Andrew Beekhof wrote:
>> 
>> What crm_mon are you looking at?
>> I see stuff like:
>> 
>> virt-fencing	(stonith:fence_xvm):	Started rhos4-node3 
>> Resource Group: mysql-group
>>     mysql-vip	(ocf::heartbeat:IPaddr2):	Started rhos4-node3 
>>     mysql-fs	(ocf::heartbeat:Filesystem):	Started rhos4-node3 
>>     mysql-db	(ocf::heartbeat:mysql):	Started rhos4-node3 
> 
> Yes, you are right.  I couldn't see the forest for the trees.
> 
> I initially was optimistic about crm_mon being more truthful than
> crm_resource but it turns out it is not.

It can't be, they're both obtaining their data from the same place (the cib).

> 
> Take for example these commands to set a constraint and start a resource
> (which has already been defined at this point):
> 
> [21/Jan/2014:13:46:40] cibadmin -o constraints -C -X '<rsc_location id="res1-primary" node="node5" rsc="res1" score="20"/>'
> [21/Jan/2014:13:46:41] cibadmin -o constraints -C -X '<rsc_location id="res1-secondary" node="node6" rsc="res1" score="10"/>'
> [21/Jan/2014:13:46:42] crm_resource -r 'res1' -p target-role -m -v 'Started'
> 
> and then these repeated calls to crm_mon -1 on node5:
> 
> [21/Jan/2014:13:46:42] crm_mon -1
> Last updated: Tue Jan 21 13:46:42 2014
> Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
> Stack: openais
> Current DC: node5 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node5 node6 ]
> 
> st-fencing	(stonith:fence_product):	Started node5 
> res1	(ocf::product:Target):	Started node6 
> 
> [21/Jan/2014:13:46:42] crm_mon -1
> Last updated: Tue Jan 21 13:46:42 2014
> Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
> Stack: openais
> Current DC: node5 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node5 node6 ]
> 
> st-fencing	(stonith:fence_product):	Started node5 
> res1	(ocf::product:Target):	Started node6 
> 
> [21/Jan/2014:13:46:49] crm_mon -1 -r
> Last updated: Tue Jan 21 13:46:49 2014
> Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
> Stack: openais
> Current DC: node5 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node5 node6 ]
> 
> Full list of resources:
> 
> st-fencing	(stonith:fence_product):	Started node5 
> res1	(ocf::product:Target):	Started node5 
> 
> The first two are not correct, showing the resource started on node6
> when it was actually started on node5.

Was it running there to begin with?
Answering my own question... yes. It was:

> Jan 21 13:46:41 node5 crmd[8695]:  warning: status_from_rc: Action 6 (res1_monitor_0) on node6 failed (target: 7 vs. rc: 0): Error

and then we try to stop it:

> Jan 21 13:46:41 node5 crmd[8695]:   notice: te_rsc_command: Initiating action 7: stop res1_stop_0 on node6

So you are correct that something is wrong, but it isn't pacemaker.

>  Finally, 7 seconds later, it is
> reporting correctly.  The logs on node{5,6} bear this out.  The resource
> was actually only ever started on node5 and never on node6.

Wrong.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140218/78a13a0f/attachment-0002.sig>