[Pacemaker] ClusterMon
Ryan Steele
ryans at aweber.com
Mon Dec 6 01:26:40 UTC 2010
Hi folks,
I'd like to use crm_mon for monitoring & email notifications, but I've
hit a snag when it comes to incorporating it into the crm
configuration. When I run crm_mon manually from the command line (with
no cluster crm configurations), it all works great, but obviously
running crm_mon on every cluster member manually would result in a
litany of duplicated messages for each resource migration, which is why
I'm looking to incorporate it into the cluster. Unfortunately, the
exact same crm_mon configuration, when entered into the cib, fails to
work, and doesn't print out any errors. To get the crm_mon
configuration into the cib, I first tried using the scriptable crm
utility, but it didn't seem to like that very much:
# crm configure primitive ResourceMonitor ocf:pacemaker:ClusterMon
params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html"
extra_options="-T ops at example.com -F 'Cluster Monitor
<ClusterMonitor at example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]:
Resource Changes Detected'" op monitor interval="10s" timeout="20s"
element nvpair: Relax-NG validity error : Type ID doesn't allow value
'ResourceMonitor-instance_attributes-ops at example.com'
element nvpair: Relax-NG validity error : Element nvpair failed to
validate attributes
Relax-NG validity error : Extra element nvpair in interleave
element nvpair: Relax-NG validity error : Element instance_attributes
failed to validate content
Relax-NG validity error : Extra element instance_attributes in interleave
element cib: Relax-NG validity error : Element cib failed to validate
content
crm_verify[1762]: 2010/12/05_19:23:03 ERROR: main: CIB did not pass
DTD/schema validation
Errors found during check: config not valid
ERROR: ResourceMonitor: parameter -F does not exist
ERROR: ResourceMonitor: parameter [LDAP Cluster]: Resource Changes
Detected does not exist
ERROR: ResourceMonitor: parameter Cluster Monitor
<ClusterMonitor at example.com> does not exist
ERROR: ResourceMonitor: parameter -H does not exist
ERROR: ResourceMonitor: parameter smtp.example.com:25 does not exist
ERROR: ResourceMonitor: parameter ops at example.com does not exist
ERROR: ResourceMonitor: parameter -P does not exist
WARNING: ResourceMonitor: default timeout 20s for start is smaller than
the advised 90
WARNING: ResourceMonitor: default timeout 20s for stop is smaller than
the advised 100
I know those are valid options since it works from the CLI, so I tried
going through the crm shell instead, hoping it was just an interpolation
issue or something like that. That approach appeared to work (albeit
with a timeout threshold warning):
crm(live)configure# primitive ResourceMonitor ocf:pacemaker:ClusterMon
params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html"
extra_options="-T ops at example.com -F 'Cluster Monitor
<ClusterMonitor at example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]:
Resource Changes Detected'" op monitor interval="10s" timeout="20s"
WARNING: ResourceMonitor: default timeout 20s for start is smaller than
the advised 90
WARNING: ResourceMonitor: default timeout 20s for stop is smaller than
the advised 100
crm(live)configure# commit
WARNING: ResourceMonitor: default timeout 20s for start is smaller than
the advised 90
WARNING: ResourceMonitor: default timeout 20s for stop is smaller than
the advised 100
crm(live)configure# exit
bye
After adding it via the crm shell, the crm_mon daemon is definitely
running (and migrates to another node if I shut down or restart corosync
on the node currently running crm_mon), but I'm not getting any email
messages. My mail sever logs confirm the message never gets there when
the crm_mon configuration is in the cluster. This same command works
when run manually from the command line, and there are no errors or
warnings in the logs, so I'm not sure what to attribute the problem to.
Here are the cluster log messages resulting from a simple resource
migration on the host running the crm_mon daemon that was spawned by the
cluster:
Dec 5 20:05:00 ldap3 external/ipmi[7032]: [7041]: debug: ipmitool
output: Chassis Power is on
Dec 5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operation
complete: op cib_delete for section constraints
(origin=ldap4/crm_resource/3, version=0.78.4): ok (rc=0)
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
<cib admin_epoch="0" epoch="78" num_updates="4" >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
<configuration >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
<constraints >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
<rsc_location id="cli-prefer-ClusterIP" >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
<rule id="cli-prefer-rule-ClusterIP" >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
<expression value="ldap4" id="cli-prefer-expr-ClusterIP" />
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
</rule>
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
</rsc_location>
Dec 5 20:05:03 ldap3 crmd: [6500]: info: abort_transition_graph:
need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
</constraints>
Dec 5 20:05:03 ldap3 crmd: [6500]: info: need_abort: Aborting on change
to admin_epoch
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
</configuration>
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: -
</cib>
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
<cib admin_epoch="0" epoch="79" num_updates="1" >
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
<configuration >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
<constraints >
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke: Query 63:
Requesting the current CIB: S_POLICY_ENGINE
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
<rsc_location id="cli-prefer-ClusterIP" >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
<rule id="cli-prefer-rule-ClusterIP" >
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
<expression value="ldap3" id="cli-prefer-expr-ClusterIP" />
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
</rule>
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
</rsc_location>
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
</constraints>
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
</configuration>
Dec 5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: +
</cib>
Dec 5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operation
complete: op cib_modify for section constraints
(origin=ldap4/crm_resource/4, version=0.79.1): ok (rc=0)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke_callback:
Invoking the PE: query=63, ref=pe_calc-dc-1291597503-34, seq=88, quorate=1
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Dec 5 20:05:03 ldap3 pengine: [6499]: info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Dec 5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status:
Node ldap3 is online
Dec 5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status:
Node ldap4 is online
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:
ClusterIP (ocf::heartbeat:IPaddr2): Started ldap4
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:
ldap3-stonith (stonith:external/ipmi): Started ldap4
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:
ldap4-stonith (stonith:external/ipmi): Started ldap3
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: native_print:
ResourceMonitor (ocf::pacemaker:ClusterMon): Started ldap3
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: RecurringOp: Start
recurring monitor (10s) for ClusterIP on ldap3
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Move resource
ClusterIP (Started ldap4 -> ldap3)
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave
resource ldap3-stonith (Started ldap4)
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave
resource ldap4-stonith (Started ldap3)
Dec 5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave
resource ResourceMonitor (Started ldap3)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Dec 5 20:05:03 ldap3 crmd: [6500]: info: unpack_graph: Unpacked
transition 3: 4 actions in 4 synapses
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_te_invoke: Processing graph
3 (ref=pe_calc-dc-1291597503-34) derived from
/var/lib/pengine/pe-input-100.bz2
Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating
action 9: stop ClusterIP_stop_0 on ldap4
Dec 5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Archived
previous version as /var/lib/heartbeat/crm/cib-64.raw
Dec 5 20:05:03 ldap3 pengine: [6499]: info: process_pe_message:
Transition 3: PEngine Input stored in: /var/lib/pengine/pe-input-100.bz2
Dec 5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Wrote
version 0.79.0 of the CIB to disk (digest: 8689b11ceba2dad1a9d93d704ff47580)
Dec 5 20:05:03 ldap3 cib: [7044]: info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.DN94W1 (digest:
/var/lib/heartbeat/crm/cib.vicJ0i)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action
ClusterIP_stop_0 (9) confirmed on ldap4 (rc=0)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating
action 10: start ClusterIP_start_0 on ldap3 (local)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performing
key=10:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_start_0 )
Dec 5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:13: start
Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_pseudo_action: Pseudo
action 5 fired and confirmed
Dec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip -f inet addr add
10.1.1.163/32 brd 10.1.1.163 dev eth1
Dec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip link set eth1 up
Dec 5 20:05:03 ldap3 IPaddr2[7045]: INFO: /usr/lib/heartbeat/send_arp
-i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.1.1.163
eth1 10.1.1.163 auto not_used not_used
Dec 5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRM
operation ClusterIP_start_0 (call=13, rc=0, cib-update=64,
confirmed=true) ok
Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action
ClusterIP_start_0 (10) confirmed on ldap3 (rc=0)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating
action 11: monitor ClusterIP_monitor_10000 on ldap3 (local)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performing
key=11:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_monitor_10000 )
Dec 5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:14: monitor
Dec 5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRM
operation ClusterIP_monitor_10000 (call=14, rc=0, cib-update=65,
confirmed=false) ok
Dec 5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action
ClusterIP_monitor_10000 (11) confirmed on ldap3 (rc=0)
Dec 5 20:05:03 ldap3 crmd: [6500]: info: run_graph:
====================================================
Dec 5 20:05:03 ldap3 crmd: [6500]: notice: run_graph: Transition 3
(Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-100.bz2): Complete
Dec 5 20:05:03 ldap3 crmd: [6500]: info: te_graph_trigger: Transition 3
is now complete
Dec 5 20:05:03 ldap3 crmd: [6500]: info: notify_crmd: Transition 3
status: done - <null>
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec 5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: Starting
PEngine Recheck Timer
Here is the output of 'crm configure show':
node ldap3
node ldap4
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="10.1.1.163" cidr_netmask="32" \
op monitor interval="10s"
primitive ResourceMonitor ocf:pacemaker:ClusterMon \
params pidfile="/var/run/crm_mon.pid"
htmlfile="/var/tmp/crm_mon.html" extra_options="-T ops at example.com -F
'Cluster Monitor <ClusterMonitor at example.com>' -H smtp.example.com:25 -P
'[LDAP Cluster]: Resource Changes Detected'" \
op monitor interval="10s" timeout="20s"
primitive ldap3-stonith stonith:external/ipmi \
params hostname="ldap3" ipaddr="10.1.0.5" userid="****"
passwd="****" interface="lan" \
op monitor interval="60s" timeout="30s"
primitive ldap4-stonith stonith:external/ipmi \
params hostname="ldap4" ipaddr="10.1.0.6" userid="****"
passwd="****" interface="lan" \
op monitor interval="60s" timeout="30s"
location cli-prefer-ClusterIP ClusterIP \
rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq ldap3
location ldap3-stonith-cmdsrc ldap3-stonith -inf: ldap3
location ldap4-stonith-cmdsrc ldap4-stonith -inf: ldap4
property $id="cib-bootstrap-options" \
dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Other than the monitoring, everything seems to work pretty well, but I
don't want to deploy this in production without a good real-time monitor
of the resource changes, so I'd appreciate any suggestions as to why
crm_mon works when run manually, but not when configured in the
cluster. For reference, I'm running on Ubuntu Server 10.04 LTS (Lucid),
and these are the packages I'm using:
cluster-agents 1:1.0.3-2ubuntu1
cluster-glue 1.0.5-1
corosync 1.2.0-0ubuntu1
libcluster-glue 1.0.5-1
libcorosync-dev 1.2.0-0ubuntu1
libcorosync4 1.2.0-0ubuntu1
libopenais3 1.1.2-0ubuntu1
openais 1.1.2-0ubuntu1
pacemaker 1.0.8+hg15494-2ubuntu2
pacemaker-dev 1.0.8+hg15494-2ubuntu2
Thanks!
More information about the Pacemaker
mailing list