[Pacemaker] [Fwd: Re: IPaddr2 not failing-over]
Ron Kerry
rkerry at sgi.com
Thu Sep 2 12:26:19 UTC 2010
Andrew Beekhof wrote:
> On Wed, Sep 1, 2010 at 2:51 PM, Ron Kerry <rkerry at sgi.com> wrote:
> > I have taken over working this issue from Vince. The ping clone
> resource and
> > constraints were setup as described in the prior attached link.
> Things were
> > still not working correctly and the resources were not failing over as
> > expected when we ifconfig'd one of the monitored interfaces down. I
> > discovered a bug in the pacemaker/ping script (from the SLE11 HAE
> > distribution) where a "*" in an expr statement had not been quoted
> and was
> > thus being interpreted by the shell.
>
> Also fixed upstream.
>
> > I fixed this problem and I was able to
> > get a single failover to occur, but after that failover the ping
> monitor was
> > canceled on the node that had the downed interface. Even after
> configuring
> > the interface back up, the monitor task never run again to notice
> that fact.
> > This essentially leaves that node with a lower score and improper
> interface
> > monitoring. I can clear the problem by stopping and then starting the
> ping
> > clone resource. Note that I have tried pulling up the full ping resource
> > agent script from the SLE11 HAE SP1 distribution and that does not
> improve
> > this particular problem (though it fixes a few others).
> >
> > I have attached the full hb_report output, but here is a log snip of
> what is
> > occurring.
> >
> > Sep 1 06:43:50 hpcnas2 root: ifconfig eth3 down
> > Sep 1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #13 eth3,
> > 10.10.20.32#123, interface stats: received=0, sent=0, dropped=0,
> > active_time=42600 secs
> > Sep 1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #15 eth3,
> > 10.10.20.33#123, interface stats: received=0, sent=0, dropped=0,
> > active_time=41100 secs
> > Sep 1 06:44:01 hpcnas2 ping[28882]: [28887]: INFO: ping monitor invoked
> > Sep 1 06:44:05 hpcnas2 ping[28882]: [28895]: ERROR: Unexpected
> result for
> > 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
> > Sep 1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_trigger_update:
> Sending
> > flush op to all hosts for: pingd (2000)
> > Sep 1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_perform_update: Sent
> > update 56: pingd=2000
> > Sep 1 06:44:14 hpcnas2 crmd: [13678]: info: do_lrm_rsc_op: Performing
> > key=34:686:0:bbe666a5-2b9f-4419-9728-803197b6e643 op=NFS_stop_0 )
> > Sep 1 06:44:14 hpcnas2 lrmd: [13675]: info: rsc:NFS:83: stop
> > ...
> > resources failover
> > ...
> > Sep 1 06:45:09 hpcnas2 ping[29241]: [29246]: INFO: ping monitor invoked
> > Sep 1 06:45:13 hpcnas2 ping[29241]: [29254]: ERROR: Unexpected
> result for
> > 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
> > Sep 1 06:45:17 hpcnas2 crmd: [13678]: info: process_lrm_event: LRM
> > operation ping:1_monitor_60000 (call=82, status=1, cib-update=0,
> > confirmed=true) Cancelled
> > Sep 1 06:45:32 hpcnas2 kernel: bnx2: eth3: using MSIX
> > Sep 1 06:45:35 hpcnas2 kernel: bnx2: eth3 NIC Copper Link is Up,
> 1000 Mbps
> > full duplex
> > Sep 1 06:45:38 hpcnas2 root: ifconfig eth3 up
> > Sep 1 06:48:08 hpcnas2 root: ping monitor appears to be no longer
> running
> >
> >
> > The concern is the "process_lrm_event: LRM operation
> ping:1_monitor_60000 ()
> > Cancelled" event.
>
> Was the resource stopped? Thats the only time I could imagine a
> recurring operation being cancelled.
No it was not stopped. In fact, from the "crm_mon" output that is included with the hb_report output
you can see that the resource still shows as running on both HA cluster nodes. How can I dig further
to figure out what and why the monitor operation is being canceled.
>
> > NOTE: The "ping monitor invoked" messages are a debug statement I
> added to
> > the RA script so I know when the ping_monitor() routine is called.
> >
> > Thanks for any assistance you can provide -- Ron
> >
--
Ron Kerry rkerry at sgi.com
Field Technical Support - SGI Federal
Home Office: 248 375-5671 Cell: 248 761-7204
--------------
NB: Information in this message is SGI confidential. It is intended solely for
the person(s) to whom it is addressed and may not be copied, used, disclosed or
distributed to others without SGI consent. If you are not the intended
recipient please notify me by email or telephone, delete the message from your
system immediately and destroy any printed copies.
More information about the Pacemaker
mailing list