[ClusterLabs] CRM managing ADSL connection; failure not handled
Ken Gaillot
kgaillot at redhat.com
Mon Aug 24 14:07:04 UTC 2015
On 08/24/2015 04:52 AM, Andrei Borzenkov wrote:
> 24.08.2015 12:35, Tom Yates пишет:
>> I've got a failover firewall pair where the external interface is ADSL;
>> that is, PPPoE. i've defined the service thus:
>>
>> primitive ExternalIP lsb:hb-adsl-helper \
>> op monitor interval="60s"
>>
>> and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:
>>
>> #!/bin/bash
>> RETVAL=0
>> start() {
>> /sbin/pppoe-start
>> }
>> stop() {
>> /sbin/pppoe-stop
>> }
>> case "$1" in
>> start)
>> start
>> ;;
>> stop)
>> stop
>> ;;
>> status)
>> /sbin/ifconfig ppp0 >& /dev/null && exit 0
>> exit 1
>> ;;
>> *)
>> echo $"Usage: $0 {start|stop|status}"
>> exit 3
>> esac
>> exit $?
Pacemaker expects that LSB agents follow the LSB spec for return codes,
and won't be able to behave properly if they don't:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-lsb
However it's just as easy to write an OCF agent, which gives you more
flexibility (accepting parameters, etc.):
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf
>> The problem is that sometimes the ADSL connection falls over, as they
>> do, eg:
>>
>> Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
>> Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
>> Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received
>> 164420300 bytes.
>> Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
>> Aug 20 11:42:13 positron pppd[2469]: Modem hangup
>> Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session
>> 1735: Input/output error
>> Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
>> Aug 20 11:42:13 positron pppd[2469]: Exit.
>> Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost;
>> attempting re-connection.
>>
>> CRMd then logs a bunch of stuff, followed by
>>
>> Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
>> Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no
>> additional parameters are needed.
>> [...]
>> Aug 20 11:42:18 positron pppoe-stop: Killing pppd
>> Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
>> Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop
>> process 28357 exited with return code 1.
>>
>>
>> At this point, the PPPoE connection is down, and stays down. CRMd
>> doesn't fail the group which contains both internal and external
>> interfaces over to the other node, but nor does it try to restart the
>> service. I'm fairly sure this is because I've done something
>> boneheaded, but I can't get my bone head around what it might be.
>>
>> Any light anyone can shed is much appreciated.
>>
>>
>
> If stop operation failed resource state is undefined; pacemaker won't do
> anything with this resource. Either make sure script returns success
> when appropriate or the only option is to make it fence node where
> resource was active.
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list