[Pacemaker] Stopping/restarting pacemaker without stopping resources?

Mon Oct 27 05:39:34 EDT 2014

> On 27 Oct 2014, at 5:40 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 
> On Mon, Oct 27, 2014 at 6:34 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> 
>>> On 27 Oct 2014, at 2:30 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>> 
>>> В Mon, 27 Oct 2014 11:09:08 +1100
>>> Andrew Beekhof <andrew at beekhof.net> пишет:
>>> 
>>>> 
>>>>> On 25 Oct 2014, at 12:38 am, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>>>> 
>>>>> On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>>>>> 
>>>>>>> On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>>>>>> 
>>>>>>> The primary goal is to transparently update software in cluster. I
>>>>>>> just did HA suite update using simple RPM and observed that RPM
>>>>>>> attempts to restart stack (rcopenais try-restart). So
>>>>>>> 
>>>>>>> a) if it worked, it would mean resources had been migrated from this
>>>>>>> node - interruption
>>>>>>> 
>>>>>>> b) it did not work - apparently new versions of installed utils were
>>>>>>> incompatible with running pacemaker so request to shutdown crm fails
>>>>>>> and openais hung forever.
>>>>>>> 
>>>>>>> The usual workflow with one cluster products I worked before was -
>>>>>>> stop cluster processes without stopping resources; update; restart
>>>>>>> cluster processes. They would detect that resources are started and
>>>>>>> return to the same state as before stopping. Is something like this
>>>>>>> possible with pacemaker?
>>>>>> 
>>>>>> absolutely.  this should be of some help:
>>>>>> 
>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html
>>>>>> 
>>>>> 
>>>>> Did not work. It ended up moving master to another node and leaving
>>>>> slave on original node stopped after that.
>>>> 
>>>> When you stopped the cluster or when you started it after an upgrade?
>>> 
>>> When I started it
>>> 
>>> crm_attribute -t crm_config -n is-managed-default -v false
>>> rcopenais stop on both nodes
>>> rcopenais start on both node; wait for them to stabilize
>>> crm_attribute -t crm_config -n is-managed-default -v true
>>> 
>>> It stopped running master/slave, moved master and left slave stopped.
>> 
>> What did crm_mon say before you set is-managed-default back to true?
>> Did the resource agent properly detect it as running in the master state?
> 
> You are right, it returned 0, not 8.
> 
>> Did the resource agent properly (re)set a preference for being promoted during the initial monitor operation?
>> 
> 
> It did, but it was too late - after it had already been demoted.
> 
>> Pacemaker can do it, but it is dependant on the resources behaving correctly.
>> 
> 
> I see.
> 
> Well, this would be a problem ... RA keeps track of current
> promoted/demoted status in CIB as transient attribute which gets reset
> after reboot.

Not only after reboot.
I would not encourage this approach, the cib could be erased/reset at any time.

The purpose of the monitor action is to discover the resource's state, reading it out of the cib defeats the point.

> This would entail quite a bit of redesign ...

A state file in /var/run ?
But ideally the RA would be able to talk to the interface/daemon/whatever and discover the true state.

> 
> But what got me confused were these errors during initial probing, like
> 
> Oct 24 17:26:54 n1 crmd[32425]:  warning: status_from_rc: Action 9
> (rsc_ip_VIP_monitor_0) on n2 failed (target: 7 vs. rc: 0): Error
> 
> This looks like pacemaker does expect resource to be in stopped state
> and "running" state would be interpreted as error?

Yes. The computed graph assumed the resource was stopped in that location.
Since that is not true, the graph must be aborted and a new one calculated.

> I mean, normal
> response to such monitor response would be to stop resource to bring
> it in target state, no?

Usually