[ClusterLabs] CRM managing ADSL connection; failure not handled
Ken Gaillot
kgaillot at redhat.com
Thu Aug 27 14:14:30 UTC 2015
On 08/27/2015 03:04 AM, Tom Yates wrote:
> On Mon, 24 Aug 2015, Andrei Borzenkov wrote:
>
>> 24.08.2015 13:32, Tom Yates пишет:
>>> if i understand you aright, my problem is that the stop script didn't
>>> return a 0 (OK) exit status, so CRM didn't know where to go. is the
>>> exit status of the stop script how CRM determines the status of the
>>> stop
>>> operation?
>>
>> correct
>>
>>> does CRM also use the output of "/etc/init.d/script status" to
>>> determine
>>> continuing successful operation?
>>
>> It definitely does not use *output* of script - only return code. If
>> the question is whether it probes resource additionally to checking
>> stop exit code - I do not think so (I know it does it in some cases
>> for systemd resources).
>
> i just thought i'd come back and follow-up. in testing this morning, i
> can confirm that the "pppoe-stop" command returns status 1 if pppd isn't
> running. that makes a standard init.d script, which passes on the
> return code of the stop command, unhelpful to CRM.
>
> i changed the script so that on stop, having run pppoe-stop, it checks
> for the existence of a working ppp0 interface, and returns 0 IFO there
> is none.
Nice
>> If resource was previously active and stop was attempted as cleanup
>> after resource failure - yes, it should attempt to start it again.
>
> that is now what happens. it seems to try three time to bring up pppd,
> then kicks the service over to the other node.
>
> in the case of extended outages (ie, the ISP goes away for more than
> about 10 minutes), where both nodes have time to fail, we end up back in
> the bad old state (service failed on both nodes):
>
> [root at positron ~]# crm status
> [...]
> Online: [ electron positron ]
>
> Resource Group: BothIPs
> InternalIP (ocf::heartbeat:IPaddr): Started electron
> ExternalIP (lsb:hb-adsl-helper): Stopped
>
> Failed actions:
> ExternalIP_monitor_60000 (node=positron, call=15, rc=7,
> status=complete): not running
> ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed
> Out): unknown exec error
> ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out):
> unknown exec error
>
> is there any way to configure CRM to keep kicking the service between
> the two nodes forever (ie, try three times on positron, kick service
> group to electron, try three times on electron, kick back to positron,
> lather rinse repeat...)?
>
> for a service like DSL, which can go away for extended periods through
> no local fault then suddenly and with no announcement come back, this
> would be most useful behaviour.
Yes, see migration-threshold and failure-timeout.
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options
> thanks to all for help with this. thanks also to those who have
> suggested i rewrite this as an OCF agent (especially to ken gaillot who
> was kind enough to point me to documentation); i will look at that if
> time permits.
More information about the Users
mailing list