[ClusterLabs] CRM managing ADSL connection; failure not handled
Tom Yates
madhatter at teaparty.net
Thu Aug 27 08:04:21 UTC 2015
On Mon, 24 Aug 2015, Andrei Borzenkov wrote:
> 24.08.2015 13:32, Tom Yates пишет:
>> if i understand you aright, my problem is that the stop script didn't
>> return a 0 (OK) exit status, so CRM didn't know where to go. is the
>> exit status of the stop script how CRM determines the status of the stop
>> operation?
>
> correct
>
>> does CRM also use the output of "/etc/init.d/script status" to determine
>> continuing successful operation?
>
> It definitely does not use *output* of script - only return code. If the
> question is whether it probes resource additionally to checking stop exit
> code - I do not think so (I know it does it in some cases for systemd
> resources).
i just thought i'd come back and follow-up. in testing this morning, i
can confirm that the "pppoe-stop" command returns status 1 if pppd isn't
running. that makes a standard init.d script, which passes on the return
code of the stop command, unhelpful to CRM.
i changed the script so that on stop, having run pppoe-stop, it checks for
the existence of a working ppp0 interface, and returns 0 IFO there is
none.
> If resource was previously active and stop was attempted as cleanup after
> resource failure - yes, it should attempt to start it again.
that is now what happens. it seems to try three time to bring up pppd,
then kicks the service over to the other node.
in the case of extended outages (ie, the ISP goes away for more than about
10 minutes), where both nodes have time to fail, we end up back in the bad
old state (service failed on both nodes):
[root at positron ~]# crm status
[...]
Online: [ electron positron ]
Resource Group: BothIPs
InternalIP (ocf::heartbeat:IPaddr): Started electron
ExternalIP (lsb:hb-adsl-helper): Stopped
Failed actions:
ExternalIP_monitor_60000 (node=positron, call=15, rc=7, status=complete): not running
ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): unknown exec error
ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): unknown exec error
is there any way to configure CRM to keep kicking the service between the
two nodes forever (ie, try three times on positron, kick service group to
electron, try three times on electron, kick back to positron, lather rinse
repeat...)?
for a service like DSL, which can go away for extended periods through no
local fault then suddenly and with no announcement come back, this would
be most useful behaviour.
thanks to all for help with this. thanks also to those who have suggested
i rewrite this as an OCF agent (especially to ken gaillot who was kind
enough to point me to documentation); i will look at that if time permits.
--
Tom Yates - http://www.teaparty.net
More information about the Users
mailing list