[Pacemaker] IPaddr2 not failing-over

Thu Sep 2 17:44:47 EDT 2010

Reproduced this on an in-house system running the same SLE11-HAE software level and the 
pacemaker/ping script from SLES11-HAE_SP1. This time the ping monitor got canceled on one of the 
systems before I even attempted to configure down an interface. I had debug logging turned on which 
provides a little more information.

 From /var/log/messages:

Sep  2 16:22:29 frog ping[2049]: [2054]: INFO: ping monitor invoked
Sep  2 16:22:33 frog attrd_updater: [2060]: info: Invoked: attrd_updater -n pingd -v 1000 -d 5s
Sep  2 16:22:33 frog attrd_updater: [2060]: info: attrd_lazy_update: Connecting to cluster... 5 
retries remaining
Sep  2 16:22:33 frog attrd_updater: [2060]: debug: init_client_ipc_comms_nodispatch: Attempting to 
talk on: /var/run/crm/attrd
Sep  2 16:22:33 frog attrd: [30708]: debug: attrd_local_callback: update message from attrd_updater: 
pingd=1000
Sep  2 16:22:33 frog attrd_updater: [2060]: debug: attrd_update: Sent update: pingd=1000 for localhost
Sep  2 16:22:33 frog attrd: [30708]: debug: attrd_local_callback: Supplied: 1000, Current: 1000, 
Stored: 1000
Sep  2 16:22:33 frog ping[2049]: [2061]: DEBUG: Updated pingd = 1000
Sep  2 16:22:33 frog lrmd: [30707]: debug: on_ra_proc_finished: operation monitor[40] on 
ocf::ping::ping:1 for client 30710, its parameters:
  CRM_meta_interval=[60000] multiplier=[1000] CRM_meta_start_delay=[15000] dampen=[5s] 
CRM_meta_on_fail=[restart] CRM_meta_timeout=[75000]
  CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] debug=[true] host_list=[10.250.100.20] 
CRM_meta_globally_unique=[false]
  CRM_meta_name=[monitor] CRM_meta_clone=[1]  cancelled.
Sep  2 16:22:33 frog crmd: [30710]: debug: process_lrm_event: Op ping:1_monitor_60000 (call=40): no 
delete event required
Sep  2 16:22:33 frog crmd: [30710]: info: process_lrm_event: LRM operation ping:1_monitor_60000 
(call=40, status=1, cib-update=0, confirmed=true) Cancelled

crm status stills shows the ping clone set (NICs_Connected) up.

frog:~ # crm status
============
Last updated: Thu Sep  2 16:31:57 2010
Stack: openais
Current DC: frog - partition with quorum
Version: 1.0.6-c48e3360eb18c53fd68bb7e7dbe39279ccbc0354
2 Nodes configured, 2 expected votes
3 Resources configured.
============

Online: [ toad frog ]

  Resource Group: nasGroup
      XVM        (ocf::sgi:lxvm):        Started frog
      data-fs    (ocf::heartbeat:Filesystem):    Started frog
      ha1-gige_ip        (ocf::heartbeat:IPaddr2):       Started frog
      ha-gige2_ip        (ocf::heartbeat:IPaddr2):       Started frog
      NFS        (ocf::heartbeat:nfsserver):     Started frog
      NMB        (lsb:nmb):      Started frog
      SMB        (lsb:smb):      Started frog
  Clone Set: stonith-sgi-ipmi-set
      Started: [ toad frog ]
  Clone Set: NICs_Connected
      Started: [ toad frog ]

Some lrm information from the CIB ...
(NOTE: that this has happened twice wit ha stop/start of NICs_Connected in between)

       <transient_attributes id="frog">
         <instance_attributes id="status-frog">
           <nvpair id="status-frog-probe_complete" name="probe_complete" value="true"/>
           <nvpair id="status-frog-fail-count-ping:1" name="fail-count-ping:1" value="2"/>
           <nvpair id="status-frog-last-failure-ping:1" name="last-failure-ping:1" value="1283462466"/>
           <nvpair id="status-frog-pingd" name="pingd" value="1000"/>
         </instance_attributes>
       </transient_attributes>

           <lrm_resource id="ping:1" type="ping" class="ocf" provider="pacemaker">
             <lrm_rsc_op id="ping:1_monitor_0" operation="monitor" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" 
transition-key="22:0:7:e1316a41-526c-4161-b9d2-1cff8e8675ee"
transition-magic="0:7;22:0:7:e1316a41-526c-4161-b9d2-1cff8e8675ee" call-id="10" rc-code="7" 
op-status="0" interval="0" last-run="1283461342"
last-rc-change="1283461342" exec-time="10" queue-time="2020" 
op-digest="56ca481ec984b961026201a8e565be64"/>
             <lrm_rsc_op id="ping:1_start_0" operation="start" crm-debug-origin="do_update_resource" 
crm_feature_set="3.0.1" transition-key="45:11:0:e1316a41-526c-4161-b9d2-1cff8e8675ee"
transition-magic="0:0;45:11:0:e1316a41-526c-4161-b9d2-1cff8e8675ee" call-id="39" rc-code="0" 
op-status="0" interval="0" last-run="1283462465"
last-rc-change="1283462465" exec-time="4020" queue-time="0" 
op-digest="79e8cf4c6a882e806598939b861da7ad"/>
             <lrm_rsc_op id="ping:1_monitor_60000" operation="monitor" 
crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" 
transition-key="11:11:0:e1316a41-526c-4161-b9d2-1cff8e8675ee"
transition-magic="0:0;11:11:0:e1316a41-526c-4161-b9d2-1cff8e8675ee" call-id="40" rc-code="0" 
op-status="0" interval="60000" last-run="1283462484"
last-rc-change="1283462484" exec-time="4020" queue-time="0" 
op-digest="c7d8c8839ca98f5d0ca1ca7ca026e0d8"/>
             <lrm_rsc_op id="ping:1_stop_0" operation="stop" crm-debug-origin="do_update_resource" 
crm_feature_set="3.0.1" transition-key="12:11:0:e1316a41-526c-4161-b9d2-1cff8e8675ee"
transition-magic="0:0;12:11:0:e1316a41-526c-4161-b9d2-1cff8e8675ee" call-id="38" rc-code="0" 
op-status="0" interval="0" last-run="1283462465"
last-rc-change="1283462465" exec-time="10" queue-time="0" op-digest="79e8cf4c6a882e806598939b861da7ad"/>
           </lrm_resource>

So the lrm is obviously issuing the cancel ... but why?

Note that I have also tested this on a full SLES11SP1 cluster running SLE11-HAE-SP1 and the ping 
resources appears to run properly in that instance. Therefore, whatever bug is at play here was 
obviously fixed at some point. Right now I am looking for a little more problem definition and maybe 
even the community bug involved so I can go to Novell and get them to provide a patch for site that 
is hitting this problem.

Thanks for any help you can provide -- Ron

Ron Kerry wrote:
> Andrew Beekhof wrote:
>  > On Wed, Sep 1, 2010 at 2:51 PM, Ron Kerry <rkerry at sgi.com> wrote:
>  >  > I have taken over working this issue from Vince. The ping clone
>  > resource and
>  >  > constraints were setup as described in the prior attached link.
>  > Things were
>  >  > still not working correctly and the resources were not failing over as
>  >  > expected when we ifconfig'd one of the monitored interfaces down. I
>  >  > discovered a bug in the pacemaker/ping script (from the SLE11 HAE
>  >  > distribution) where a "*" in an expr statement had not been quoted
>  > and was
>  >  > thus being interpreted by the shell.
>  >
>  > Also fixed upstream.
>  >
>  >  > I fixed this problem and I was able to
>  >  > get a single failover to occur, but after that failover the ping
>  > monitor was
>  >  > canceled on the node that had the downed interface. Even after
>  > configuring
>  >  > the interface back up, the monitor task never run again to notice
>  > that fact.
>  >  > This essentially leaves that node with a lower score and improper
>  > interface
>  >  > monitoring. I can clear the problem by stopping and then starting the
>  > ping
>  >  > clone resource. Note that I have tried pulling up the full ping 
> resource
>  >  > agent script from the SLE11 HAE SP1 distribution and that does not
>  > improve
>  >  > this particular problem (though it fixes a few others).
>  >  >
>  >  > I have attached the full hb_report output, but here is a log snip of
>  > what is
>  >  > occurring.
>  >  >
>  >  > Sep  1 06:43:50 hpcnas2 root: ifconfig eth3 down
>  >  > Sep  1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #13 eth3,
>  >  > 10.10.20.32#123, interface stats: received=0, sent=0, dropped=0,
>  >  > active_time=42600 secs
>  >  > Sep  1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #15 eth3,
>  >  > 10.10.20.33#123, interface stats: received=0, sent=0, dropped=0,
>  >  > active_time=41100 secs
>  >  > Sep  1 06:44:01 hpcnas2 ping[28882]: [28887]: INFO: ping monitor 
> invoked
>  >  > Sep  1 06:44:05 hpcnas2 ping[28882]: [28895]: ERROR: Unexpected
>  > result for
>  >  > 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
>  >  > Sep  1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_trigger_update:
>  > Sending
>  >  > flush op to all hosts for: pingd (2000)
>  >  > Sep  1 06:44:14 hpcnas2 attrd: [13676]: info: 
> attrd_perform_update: Sent
>  >  > update 56: pingd=2000
>  >  > Sep  1 06:44:14 hpcnas2 crmd: [13678]: info: do_lrm_rsc_op: Performing
>  >  > key=34:686:0:bbe666a5-2b9f-4419-9728-803197b6e643 op=NFS_stop_0 )
>  >  > Sep  1 06:44:14 hpcnas2 lrmd: [13675]: info: rsc:NFS:83: stop
>  >  > ...
>  >  > resources failover
>  >  > ...
>  >  > Sep  1 06:45:09 hpcnas2 ping[29241]: [29246]: INFO: ping monitor 
> invoked
>  >  > Sep  1 06:45:13 hpcnas2 ping[29241]: [29254]: ERROR: Unexpected
>  > result for
>  >  > 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
>  >  > Sep  1 06:45:17 hpcnas2 crmd: [13678]: info: process_lrm_event: LRM
>  >  > operation ping:1_monitor_60000 (call=82, status=1, cib-update=0,
>  >  > confirmed=true) Cancelled
>  >  > Sep  1 06:45:32 hpcnas2 kernel: bnx2: eth3: using MSIX
>  >  > Sep  1 06:45:35 hpcnas2 kernel: bnx2: eth3 NIC Copper Link is Up,
>  > 1000 Mbps
>  >  > full duplex
>  >  > Sep  1 06:45:38 hpcnas2 root: ifconfig eth3 up
>  >  > Sep  1 06:48:08 hpcnas2 root: ping monitor appears to be no longer
>  > running
>  >  >
>  >  >
>  >  > The concern is the "process_lrm_event: LRM operation
>  > ping:1_monitor_60000 ()
>  >  > Cancelled" event.
>  >
>  > Was the resource stopped?  Thats the only time I could imagine a
>  > recurring operation being cancelled.
> 
> No it was not stopped. In fact, from the "crm_mon" output that is 
> included with the hb_report output
> you can see that the resource still shows as running on both HA cluster 
> nodes. How can I dig further
> to figure out what and why the monitor operation is being canceled.
> 
> 
>  >
>  >  > NOTE: The "ping monitor invoked" messages are a debug statement I
>  > added to
>  >  > the RA script so I know when the ping_monitor() routine is called.
>  >  >
>  >  > Thanks for any assistance you can provide -- Ron
>  >  >
> 
> 
> 
> --
> 
> Ron Kerry         rkerry at sgi.com
> Field Technical Support - SGI Federal
> Home Office: 248 375-5671  Cell: 248 761-7204
> 
> --------------
> NB: Information in this message is SGI confidential. It is intended 
> solely for
> the person(s) to whom it is addressed and may not be copied, used, 
> disclosed or
> distributed to others without SGI consent. If you are not the intended
> recipient please notify me by email or telephone, delete the message 
> from your
> system immediately and destroy any printed copies.
> 

-- 

Ron Kerry         rkerry at sgi.com
Field Technical Support - SGI Federal
Home Office: 248 375-5671  Cell: 248 761-7204

--------------
NB: Information in this message is SGI confidential. It is intended solely for
the person(s) to whom it is addressed and may not be copied, used, disclosed or
distributed to others without SGI consent. If you are not the intended
recipient please notify me by email or telephone, delete the message from your
system immediately and destroy any printed copies.