[Pacemaker] [Fwd: Re: IPaddr2 not failing-over]

Thu Sep 2 06:24:30 UTC 2010

On Wed, Sep 1, 2010 at 2:51 PM, Ron Kerry <rkerry at sgi.com> wrote:
> I have taken over working this issue from Vince. The ping clone resource and
> constraints were setup as described in the prior attached link. Things were
> still not working correctly and the resources were not failing over as
> expected when we ifconfig'd one of the monitored interfaces down. I
> discovered a bug in the pacemaker/ping script (from the SLE11 HAE
> distribution) where a "*" in an expr statement had not been quoted and was
> thus being interpreted by the shell.

Also fixed upstream.

> I fixed this problem and I was able to
> get a single failover to occur, but after that failover the ping monitor was
> canceled on the node that had the downed interface. Even after configuring
> the interface back up, the monitor task never run again to notice that fact.
> This essentially leaves that node with a lower score and improper interface
> monitoring. I can clear the problem by stopping and then starting the ping
> clone resource. Note that I have tried pulling up the full ping resource
> agent script from the SLE11 HAE SP1 distribution and that does not improve
> this particular problem (though it fixes a few others).
>
> I have attached the full hb_report output, but here is a log snip of what is
> occurring.
>
> Sep  1 06:43:50 hpcnas2 root: ifconfig eth3 down
> Sep  1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #13 eth3,
> 10.10.20.32#123, interface stats: received=0, sent=0, dropped=0,
> active_time=42600 secs
> Sep  1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #15 eth3,
> 10.10.20.33#123, interface stats: received=0, sent=0, dropped=0,
> active_time=41100 secs
> Sep  1 06:44:01 hpcnas2 ping[28882]: [28887]: INFO: ping monitor invoked
> Sep  1 06:44:05 hpcnas2 ping[28882]: [28895]: ERROR: Unexpected result for
> 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
> Sep  1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_trigger_update: Sending
> flush op to all hosts for: pingd (2000)
> Sep  1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_perform_update: Sent
> update 56: pingd=2000
> Sep  1 06:44:14 hpcnas2 crmd: [13678]: info: do_lrm_rsc_op: Performing
> key=34:686:0:bbe666a5-2b9f-4419-9728-803197b6e643 op=NFS_stop_0 )
> Sep  1 06:44:14 hpcnas2 lrmd: [13675]: info: rsc:NFS:83: stop
> ...
> resources failover
> ...
> Sep  1 06:45:09 hpcnas2 ping[29241]: [29246]: INFO: ping monitor invoked
> Sep  1 06:45:13 hpcnas2 ping[29241]: [29254]: ERROR: Unexpected result for
> 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
> Sep  1 06:45:17 hpcnas2 crmd: [13678]: info: process_lrm_event: LRM
> operation ping:1_monitor_60000 (call=82, status=1, cib-update=0,
> confirmed=true) Cancelled
> Sep  1 06:45:32 hpcnas2 kernel: bnx2: eth3: using MSIX
> Sep  1 06:45:35 hpcnas2 kernel: bnx2: eth3 NIC Copper Link is Up, 1000 Mbps
> full duplex
> Sep  1 06:45:38 hpcnas2 root: ifconfig eth3 up
> Sep  1 06:48:08 hpcnas2 root: ping monitor appears to be no longer running
>
>
> The concern is the "process_lrm_event: LRM operation ping:1_monitor_60000 ()
> Cancelled" event.

Was the resource stopped?  Thats the only time I could imagine a
recurring operation being cancelled.

> NOTE: The "ping monitor invoked" messages are a debug statement I added to
> the RA script so I know when the ping_monitor() routine is called.
>
> Thanks for any assistance you can provide -- Ron
>
>
>
> Nate Pearlstein wrote:
>>
>> Subject:
>> Re: [Pacemaker] IPaddr2 not failing-over
>> From:
>> "Andrew Beekhof" <andrew at beekhof.net>
>> Date:
>> Thu, 26 Aug 2010 02:47:46 -0500
>> To:
>> "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>>
>> To:
>> "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>>
>>
>> On Wed, Aug 11, 2010 at 10:55 PM, Vince Gabriel <vinceg at sgi.com> wrote:
>>  > Hi everyone,
>>  >
>>  > I have new cluster that is works exceptionally well with the exception
>> of
>>  > the IPaddr2 virtual interfaces initiated failovers. If the interface is
>>  > downed or cable disconnected, a failover never happens. I’ve attempted
>> to
>>  > incorporate pingd however that has not helped either? It’s my
>> understanding
>>  > a pingd clone should not be needed any long?
>>
>> If you want to move services based on connectivity, then you need a
>> ping(d) clone and some rules that make use of the properties it sets.
>>
>>
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03.html
>>
>>  >
>>  > nas1:~ # rpm -qa | grep hear
>>  >
>>  > heartbeat-resources-3.0.0-0.2.8
>>  >
>>  > heartbeat-common-3.0.0-0.6.5
>>  >
>>  > libheartbeat2-3.0.0-0.6.5
>>  >
>>  > cnas1:~ # rpm -qa | grep -i pace
>>  >
>>  > pacemaker-pygui-1.99.2-0.2.6
>>  >
>>  > libpacemaker3-1.0.5-0.5.6
>>  >
>>  > pacemaker-1.0.5-0.5.6
>>  >
>>  > primitive HA3-ip ocf:heartbeat:IPaddr2 \
>>  >
>>  >         operations $id="HA3-ip-operations" \
>>  >
>>  >         op monitor interval="60s" start-delay="0" timeout="30s"
>>  > on-fail="restart" \
>>  >
>>  >         op start interval="0" timeout="90" on-fail="restart"
>>  > requires="fencing" \
>>  >
>>  >         op stop interval="0" timeout="100" on-fail="fence" \
>>  >
>>  >         params ip="10.10.20.33" nic="eth3" cidr_netmask="24" \
>>  >
>>  >         meta resource-stickiness="1" migration-threshold="1"
>>  >
>>  > It’s my understanding…please correct me if I’m wrong….if the interface
>> fails
>>  > it will attempt to restart the interface once,
>>
>> No, only if the resource fails.
>> Your logic only holds if the RA reports failure when the interface fails.
>>
>>  > if it happens again the group
>>  > it’s associated with should failover to the standby node based on
>>  > “migration-threshold="1"”.
>>  >
>>  > Thanks in Advance,
>>  >
>>  > -Vince
>>  >
>>  > --
>>  >
>>  > Vince Gabriel
>>  >
>>  > Field Technical Analyst
>>  >
>>  > SGI
>>  >
>>  > office: 361.729.9151
>>  >
>>  > cell:  409.392.8083
>>  >
>>  >
>>  >
>>  >
>>  >
>>  > _______________________________________________
>>  > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>  >
>>  > Project Home: http://www.clusterlabs.org
>>  > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  > Bugs:
>>  >
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>  >
>>  >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
>
> --
>
> Ron Kerry         rkerry at sgi.com
> Field Technical Support - SGI Federal
> Home Office: 248 375-5671  Cell: 248 761-7204
>
> --------------
> NB: Information in this message is SGI confidential. It is intended solely
> for
> the person(s) to whom it is addressed and may not be copied, used, disclosed
> or
> distributed to others without SGI consent. If you are not the intended
> recipient please notify me by email or telephone, delete the message from
> your
> system immediately and destroy any printed copies.
>