[Pacemaker] monitor operation cancel question

Tue Sep 21 18:58:19 UTC 2010

Hi,

This is my first post to this list so if I'm doing this wrong, please be 
patient. I am using pacemaker-1.1.2-0.2.1 on sles11sp1. Thanks in 
advance for any help anyone can give me.

I have configured an HA group with a number of primitives. As part of adding
the primitives, we do a simple manual relocation of the group from one 
node to the other and back again after each prmitive has been added to 
the group to verfiy the config as we go. I had done this for 7 primitives.

After I added this last primitive, doing a relocation:

crm_resource -r dmfGroup -M -H node2

This results in the primitive that was added last (tmf), and which gets 
stopped first, failing becuase the periodic monitor operation for the 
resource doesn't get cancelled like the rest of the primitive's monitor 
ops do. I have added a snippet from the log. Notice that the VirtualIP 
and  dskmspfs primitives get their monitor ops cancelled, but tmf 
didn't. There is a process_pe_message with an error during what appears 
to be the tmf primitive processing, but I can find no errors using 
crm_verify no mattter how many V's is use to indicate a problem with 
tmf's config. the tmf resource works fine until it is time to  relocate.

Does anyone have any clues for me where to look next ?

Here is tmf's xml:

        <primitive class="ocf" id="tmf" provider="sgi" type="tmf">
          <operations id="tmf-operations">
            <op id="tmf-op-monitor-30s" interval="30s" name="monitor" 
on-fail="restart" start-delay="480s" timeout="30s"/>
            <op id="tmf-op-start-0" interval="0" name="start" 
on-fail="restart" requires="fencing" timeout="480s"/>
            <op id="tmf-op-stop-0" interval="0" name="stop" 
on-fail="fence" timeout="480s"/>
          </operations>
          <instance_attributes id="tmf-instance_attributes">
            <nvpair id="tmf-instance_attributes-devgrpnames" 
name="devgrpnames" value="drives"/>
            <nvpair id="tmf-instance_attributes-mindevsup" 
name="mindevsup" value="1"/>
            <nvpair id="tmf-instance_attributes-devtimeout" 
name="devtimeout" value="240"/>
            <nvpair id="tmf-instance_attributes-loader_names" 
name="loader_names" value="sl3000"/>
            <nvpair id="tmf-instance_attributes-loader_hosts" 
name="loader_hosts" value="acsls1"/>
            <nvpair id="tmf-instance_attributes-loader_users" 
name="loader_users" value="root,acssa"/>
            <nvpair id="tmf-instance_attributes-loader_passwords" 
name="loader_passwords" value="weasel!,dog3ear"/>
          </instance_attributes>
          <meta_attributes id="tmf-meta_attributes">
            <nvpair id="tmf-meta_attributes-resource-stickiness" 
name="resource-stickiness" value="1"/>
            <nvpair id="tmf-meta_attributes-migration-threshold" 
name="migration-threshold" value="1"/>
          </meta_attributes>
        </primitive>

Here's the log:

Sep 21 10:35:45 pry crm_resource: [14746]: info: Invoked: crm_resource 
-r dmfGroup -M -H punch
Sep 21 10:35:45 pry cib: [5597]: info: cib_process_request: Operation 
complete: op cib_delete for section constraints 
(origin=local/crm_resource/3, version=0.85.17): ok (rc=0)
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: - 
<cib admin_epoch="0" epoch="85" num_updates="17" />
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: + 
<cib admin_epoch="0" epoch="86" num_updates="1" >
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: +   
<configuration >
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: +     
<constraints >
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: 
+       <rsc_location id="cli-prefer-dmfGroup" rsc="dmfGroup" 
__crm_diff_marker__="added:top" >
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: 
+         <rule id="cli-prefer-rule-dmfGroup" score="INFINITY" 
boolean-op="and" >
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: 
+           <expression id="cli-prefer-expr-dmfGroup" attribute="#uname" 
operation="eq" value="punch" type="string" />
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: 
+         </rule>
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: 
+       </rsc_location>
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: +     
</constraints>
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: +   
</configuration>
Sep 21 10:35:45 pry cib: [5597]: info: log_data_element: cib:diff: + </cib>
Sep 21 10:35:45 pry cib: [5597]: info: cib_process_request: Operation 
complete: op cib_modify for section constraints 
(origin=local/crm_resource/4, version=0.86.1): ok (rc=0)
Sep 21 10:35:45 pry crmd: [5601]: info: abort_transition_graph: 
need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Sep 21 10:35:46 pry crmd: [5601]: info: need_abort: Aborting on change 
to admin_epoch
Sep 21 10:35:46 pry crmd: [5601]: info: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Sep 21 10:35:46 pry crmd: [5601]: info: do_state_transition: All 2 
cluster nodes are eligible to run resources.
Sep 21 10:35:46 pry crmd: [5601]: info: do_pe_invoke: Query 258: 
Requesting the current CIB: S_POLICY_ENGINE
Sep 21 10:35:46 pry kernel: cib(5597): unaligned access to 
0x2000000001801b6d, ip=0x200000000042e830
Sep 21 10:35:46 pry kernel: cib(5597): unaligned access to 
0x2000000001801b7d, ip=0x2000000000430dd0
Sep 21 10:35:46 pry kernel: cib(5597): unaligned access to 
0x2000000001801b6d, ip=0x2000000000431021
Sep 21 10:35:46 pry kernel: cib(5597): unaligned access to 
0x2000000001801b6d, ip=0x2000000000431241
Sep 21 10:35:46 pry kernel: cib(5597): unaligned access to 
0x2000000001801b6d, ip=0x2000000000431401
Sep 21 10:35:46 pry crmd: [5601]: info: do_pe_invoke_callback: Invoking 
the PE: query=258, ref=pe_calc-dc-1285083346-211, seq=156, quorate=1
Sep 21 10:35:46 pry pengine: [5600]: info: unpack_config: Startup 
probes: enabled
Sep 21 10:35:46 pry pengine: [5600]: notice: unpack_config: On loss of 
CCM Quorum: Ignore
Sep 21 10:35:46 pry pengine: [5600]: info: unpack_config: Node scores: 
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Sep 21 10:35:46 pry pengine: [5600]: info: unpack_domains: Unpacking domains
Sep 21 10:35:46 pry pengine: [5600]: info: determine_online_status: Node 
pry is online
Sep 21 10:35:46 pry pengine: [5600]: info: determine_online_status: Node 
punch is online
Sep 21 10:35:46 pry pengine: [5600]: notice: group_print:  Resource 
Group: dmfGroup
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      
local_xvm       (ocf::sgi:lxvm):        Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      
dmfusr1fs       (ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      
dmfusr2fs       (ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      homefs  
(ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry cib: [14747]: info: write_cib_contents: Archived 
previous version as /var/lib/heartbeat/crm/cib-86.raw
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      
journalfs       (ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      spoolfs 
(ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      tmpfs   
(ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      movefs  
(ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      
dskmspfs        (ocf::heartbeat:Filesystem):    Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      
VirtualIP       (ocf::heartbeat:IPaddr2):       Started pry
Sep 21 10:35:46 pry pengine: [5600]: notice: native_print:      tmf     
(ocf::sgi:tmf): Started pry
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (30s) for local_xvm on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for dmfusr1fs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for dmfusr2fs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for homefs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for journalfs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for spoolfs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for tmpfs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for movefs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (20s) for dskmspfs on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (10s) for VirtualIP on punch
Sep 21 10:35:46 pry pengine: [5600]: ERROR: unpack_operation: Specifying 
on_fail=fence and stonith-enabled=false makes no sense
Sep 21 10:35:46 pry pengine: [5600]: notice: RecurringOp:  Start 
recurring monitor (30s) for tmf on punch
Sep 21 10:35:46 pry cib: [14747]: info: write_cib_contents: Wrote 
version 0.86.0 of the CIB to disk (digest: 58aef84a6b3b4a14ca7ed17f167b6fcd)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
local_xvm        (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
dmfusr1fs        (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
dmfusr2fs        (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
homefs   (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
journalfs        (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
spoolfs  (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
tmpfs    (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
movefs   (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
dskmspfs (Started pry -> punch)
Sep 21 10:35:46 pry cib: [14747]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.3aJ5F5 (digest: 
/var/lib/heartbeat/crm/cib.TlYNDx)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
VirtualIP        (Started pry -> punch)
Sep 21 10:35:46 pry pengine: [5600]: notice: LogActions: Move resource 
tmf      (Started pry -> punch)
Sep 21 10:35:46 pry crmd: [5601]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 21 10:35:46 pry crmd: [5601]: info: unpack_graph: Unpacked 
transition 52: 38 actions in 38 synapses
Sep 21 10:35:46 pry crmd: [5601]: info: do_te_invoke: Processing graph 
52 (ref=pe_calc-dc-1285083346-211) derived from 
/var/lib/pengine/pe-input-110.bz2
Sep 21 10:35:46 pry crmd: [5601]: info: te_pseudo_action: Pseudo action 
51 fired and confirmed
Sep 21 10:35:46 pry crmd: [5601]: info: te_rsc_command: Initiating 
action 46: stop tmf_stop_0 on pry (local)
Sep 21 10:35:46 pry crmd: [5601]: info: do_lrm_rsc_op: Performing 
key=46:52:0:bb6ff40d-ddaf-4b6f-9d79-e06944f1dac5 op=tmf_stop_0 )
Sep 21 10:35:46 pry lrmd: [5598]: info: rsc:tmf:74: stop
Sep 21 10:35:46 pry pengine: [5600]: info: process_pe_message: 
Transition 52: PEngine Input stored in: /var/lib/pengine/pe-input-110.bz2
Sep 21 10:35:46 pry pengine: [5600]: info: process_pe_message: 
Configuration ERRORs found during PE processing.  Please run "crm_verify 
-L" to identify issues.
Sep 21 10:35:46 pry TapeReleased[8042]: |$(413) 
pci0002:00:01.1/fc/500104f000acc66d-500104f000acc66e/lun0, pid=14775, 
device released
Sep 21 10:35:46 pry tmf[14748]: INFO: stop_all_tmf_resources(): 
successfully stopped all devices
Sep 21 10:35:46 pry tmf[14748]: INFO: stop_tmf(): tmf successfully stopped
Sep 21 10:35:46 pry crmd: [5601]: info: process_lrm_event: LRM operation 
tmf_stop_0 (call=74, rc=0, cib-update=259, confirmed=true) ok
Sep 21 10:35:46 pry crmd: [5601]: info: match_graph_event: Action 
tmf_stop_0 (46) confirmed on pry (rc=0)
Sep 21 10:35:46 pry crmd: [5601]: info: te_rsc_command: Initiating 
action 43: stop VirtualIP_stop_0 on pry (local)
Sep 21 10:35:46 pry lrmd: [5598]: info: cancel_op: operation monitor[68] 
on ocf::IPaddr2::VirtualIP for client 5601, its parameters: 
CRM_meta_interval=[10000] cidr_netmask=[24] nic=[eth0] 
broadcast=[128.162.246.255] crm_feature_set=[3.0.2] ip=[128.162.246.8] 
CRM_meta_on_fail=[restart] CRM_meta_name=[monitor] 
CRM_meta_start_delay=[90000] CRM_meta_timeout=[110000] 
CRM_meta_requires=[fencing]  cancelled
Sep 21 10:35:46 pry crmd: [5601]: info: do_lrm_rsc_op: Performing 
key=43:52:0:bb6ff40d-ddaf-4b6f-9d79-e06944f1dac5 op=VirtualIP_stop_0 )
Sep 21 10:35:46 pry lrmd: [5598]: info: rsc:VirtualIP:75: stop
Sep 21 10:35:46 pry crmd: [5601]: info: process_lrm_event: LRM operation 
VirtualIP_monitor_10000 (call=68, status=1, cib-update=0, 
confirmed=true) Cancelled
Sep 21 10:35:47 pry IPaddr2[14796]: INFO: IP status = ok, IP_CIP=
Sep 21 10:35:47 pry IPaddr2[14796]: INFO: ip -f inet addr delete 
128.162.246.8/24 dev eth0
Sep 21 10:35:47 pry crmd: [5601]: info: process_lrm_event: LRM operation 
VirtualIP_stop_0 (call=75, rc=0, cib-update=260, confirmed=true) ok
Sep 21 10:35:47 pry crmd: [5601]: info: match_graph_event: Action 
VirtualIP_stop_0 (43) confirmed on pry (rc=0)
Sep 21 10:35:47 pry crmd: [5601]: info: te_rsc_command: Initiating 
action 40: stop dskmspfs_stop_0 on pry (local)
Sep 21 10:35:47 pry lrmd: [5598]: info: cancel_op: operation monitor[66] 
on ocf::Filesystem::dskmspfs for client 5601, its parameters: 
CRM_meta_requires=[fencing] fstype=[xfs] 
device=[/dev/lxvm/tp9400_11_12_dmfbases5] crm_feature_set=[3.0.2] 
options=[rw,dmapi,mtpt=/dmf/dskmsp_store] directory=[/dmf/dskmsp_store] 
CRM_meta_on_fail=[restart] CRM_meta_name=[monitor] 
CRM_meta_start_delay=[60000] CRM_meta_interval=[20000] 
CRM_meta_timeout=[100000]  cancelled
Sep 21 10:35:47 pry crmd: [5601]: info: do_lrm_rsc_op: Performing 
key=40:52:0:bb6ff40d-ddaf-4b6f-9d79-e06944f1dac5 op=dskmspfs_stop_0 )
Sep 21 10:35:47 pry lrmd: [5598]: info: rsc:dskmspfs:76: stop
Sep 21 10:35:47 pry crmd: [5601]: info: process_lrm_event: LRM operation 
dskmspfs_monitor_20000 (call=66, status=1, cib-update=0, confirmed=true) 
Cancelled

-- 
	Phil Armstrong       pma at sgi.com
	Phone: 651-683-5561  VNET 233-5561