[Pacemaker] stonith sbd problem
philipp.achmueller at arz.at
philipp.achmueller at arz.at
Tue Aug 10 08:16:05 UTC 2010
hi,
following configuration:
node lnx0047a
node lnx0047b
primitive lnx0101a ocf:heartbeat:KVM \
params name="lnx0101a" \
meta allow-migrate="1" target-role="Started" \
op migrate_from interval="0" timeout="3600s" \
op migrate_to interval="0" timeout="3600s" \
op monitor interval="10s" \
op stop interval="0" timeout="360s"
primitive lnx0102a ocf:heartbeat:KVM \
params name="lnx0102a" \
meta allow-migrate="1" target-role="Started" \
op migrate_from interval="0" timeout="3600s" \
op migrate_to interval="0" timeout="3600s" \
op monitor interval="10s" \
op stop interval="0" timeout="360s"
primitive pingd ocf:pacemaker:pingd \
params host_list="192.168.136.100" multiplier="100" \
op monitor interval="15s" timeout="5s"
primitive sbd_fence stonith:external/sbd \
params sbd_device="/dev/hdisk-4652-38b5" stonith-timeout="60s"
clone fence sbd_fence \
meta target-role="Started"
clone pingdclone pingd \
meta globally-unique="false" target-role="Started"
location lnx0101a_ip lnx0101a \
rule $id="lnx0101a_ip-rule" -inf: not_defined pingd or pingd lte 0
location lnx0102a_ip lnx0102a \
rule $id="lnx0102a_ip-rule" -inf: not_defined pingd or pingd lte 0
property $id="cib-bootstrap-options" \
dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true" \
stonith-action="reboot" \
no-quorum-policy="ignore" \
default-resource-stickiness="1000" \
last-lrm-refresh="1281364675"
-------------------------------
during clustertest i disabled the interface where pingd ist listening on
node lnx0047a. i get "Node lnx0047a: UNCLEAN (offline)" on lnx0047b, the
stonith command is being executed:
/var/log/messages:
...
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: pe_fence_node: Node
lnx0047a will be fenced because it is un-expectedly down
...
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action
lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking
node lnx0047a unclean
Aug 9 16:25:05 lnx0047b pengine: [22211]: notice: RecurringOp: Start
recurring monitor (10s) for lnx0102a on lnx0047b
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action
pingd:0_stop_0 on lnx0047a is unrunnable (offline)
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking
node lnx0047a unclean
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action
sbd_fence:0_stop_0 on lnx0047a is unrunnable (offline)
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking
node lnx0047a unclean
Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: stage6: Scheduling Node
lnx0047a for STONITH
Aug 9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints:
lnx0102a_stop_0 is implicit after lnx0047a is fenced
Aug 9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints:
pingd:0_stop_0 is implicit after lnx0047a is fenced
....
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
lnx0047a: ee3d0c69-067a-423b-88bc-6d661a2b3254
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element:
stonith_query: Query <stonith_command t="stonith-ng"
st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_query"
st_callid="0" st_callopt="0"
st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a"
st_device_action="reboot"
st_clientid="eba960fb-ef44-4ffb-a017-d5e01177b4ec" src="lnx0047b" seq="32"
/>
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info:
can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_query: Found 1
matching devices for 'lnx0047a'
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_command:
Processed st_query from lnx0047b: rc=1
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: call_remote_stonith:
Requesting that lnx0047b perform op reboot lnx0047a
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element:
stonith_fence: Exec <stonith_command t="stonith-ng"
st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_fence"
st_callid="0" st_callopt="0"
st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a"
st_device_action="reboot" src="lnx0047b" seq="34" />
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info:
can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_fence: Found 1
matching devices for 'lnx0047a'
Aug 9 16:25:26 lnx0047b pengine: [22211]: WARN: process_pe_message:
Transition 6: WARNINGs found during PE processing. PEngine Input stored
in: /var/lib/pengine/pe-warn-102.bz2
Aug 9 16:25:26 lnx0047b pengine: [22211]: info: process_pe_message:
Configuration WARNINGs found during PE processing. Please run "crm_verify
-L" to identify issues.
Aug 9 16:25:26 lnx0047b sbd: [23278]: info: reset successfully delivered
to lnx0047a
Aug 9 16:25:27 lnx0047b sbd: [23845]: info: lnx0047a owns slot 1
Aug 9 16:25:27 lnx0047b sbd: [23845]: info: Writing reset to node slot
lnx0047a
....
-------
ps -eaf:
...
root 24002 24001 0 16:25 ? 00:00:00 stonith -t external/sbd
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
reset lnx0047a
root 24007 24002 0 16:25 ? 00:00:00 /bin/bash
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
root 24035 22192 0 16:25 ? 00:00:00
/usr/lib64/heartbeat/stonithd
...
lnx0047a reboots successful, but during the image startup of images
lnx0047a several stonith commands being executed on the online
clusternode:
$ ps -eaf|grep ston
root 22207 22192 0 16:15 ? 00:00:00
/usr/lib64/heartbeat/stonithd
root 23272 23271 0 16:25 ? 00:00:00 stonith -t external/sbd
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
reset lnx0047a
root 23277 23272 0 16:25 ? 00:00:00 /bin/bash
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
root 23340 23339 0 16:26 ? 00:00:00 stonith -t external/sbd
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
reset lnx0047a
root 23345 23340 0 16:26 ? 00:00:00 /bin/bash
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
root 23438 23437 0 16:26 ? 00:00:00 stonith -t external/sbd
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
reset lnx0047a
root 23443 23438 0 16:26 ? 00:00:00 /bin/bash
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
after lnx0047a is up again it get stonithed automatically by lnx0047b,
althought the cluster isn't up and running (autostart watchdog)
-----------------
so, i'm unable to start lnx0047a until i manually kill alle the stonith
processes on lnx0047b.
during reboot-cycle on lnx0047a the Resources aren't able to start on
lnx0047b:
$ crm_verify -LV
crm_verify[27816]: 2010/08/09_16:25:41 WARN: pe_fence_node: Node lnx0047a
will be fenced because it is un-expectedly down
crm_verify[27816]: 2010/08/09_16:25:41 WARN: determine_online_status: Node
lnx0047a is unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
lnx0101a_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
pingd:0_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
sbd_fence:1_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: stage6: Scheduling Node
lnx0047a for STONITH
###############
any ideas on the stonith problem?
any ideas on the "unrunnable" problem?
regards
----------------
Disclaimer:
Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur
für den Gebrauch des angesprochenen Adressaten bestimmt.
This message is only for informational purposes and is intended solely for
the use of the addressee.
----------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100810/267f792d/attachment-0001.html>
More information about the Pacemaker
mailing list