[Pacemaker] stonith sbd problem
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Aug 10 10:56:17 UTC 2010
Hi,
On Tue, Aug 10, 2010 at 10:16:05AM +0200, philipp.achmueller at arz.at wrote:
> hi,
>
> following configuration:
>
> node lnx0047a
> node lnx0047b
> primitive lnx0101a ocf:heartbeat:KVM \
> params name="lnx0101a" \
> meta allow-migrate="1" target-role="Started" \
> op migrate_from interval="0" timeout="3600s" \
> op migrate_to interval="0" timeout="3600s" \
> op monitor interval="10s" \
> op stop interval="0" timeout="360s"
> primitive lnx0102a ocf:heartbeat:KVM \
> params name="lnx0102a" \
> meta allow-migrate="1" target-role="Started" \
> op migrate_from interval="0" timeout="3600s" \
> op migrate_to interval="0" timeout="3600s" \
> op monitor interval="10s" \
> op stop interval="0" timeout="360s"
> primitive pingd ocf:pacemaker:pingd \
> params host_list="192.168.136.100" multiplier="100" \
> op monitor interval="15s" timeout="5s"
> primitive sbd_fence stonith:external/sbd \
> params sbd_device="/dev/hdisk-4652-38b5" stonith-timeout="60s"
> clone fence sbd_fence \
> meta target-role="Started"
You shouldn't run sbd as a clone.
> clone pingdclone pingd \
> meta globally-unique="false" target-role="Started"
> location lnx0101a_ip lnx0101a \
> rule $id="lnx0101a_ip-rule" -inf: not_defined pingd or pingd lte 0
> location lnx0102a_ip lnx0102a \
> rule $id="lnx0102a_ip-rule" -inf: not_defined pingd or pingd lte 0
> property $id="cib-bootstrap-options" \
> dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="true" \
> stonith-action="reboot" \
> no-quorum-policy="ignore" \
> default-resource-stickiness="1000" \
> last-lrm-refresh="1281364675"
>
> -------------------------------
> during clustertest i disabled the interface where pingd ist listening on
> node lnx0047a. i get "Node lnx0047a: UNCLEAN (offline)" on lnx0047b, the
> stonith command is being executed:
>
> /var/log/messages:
> ...
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: pe_fence_node: Node
> lnx0047a will be fenced because it is un-expectedly down
> ...
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action
> lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking
> node lnx0047a unclean
> Aug 9 16:25:05 lnx0047b pengine: [22211]: notice: RecurringOp: Start
> recurring monitor (10s) for lnx0102a on lnx0047b
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action
> pingd:0_stop_0 on lnx0047a is unrunnable (offline)
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking
> node lnx0047a unclean
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action
> sbd_fence:0_stop_0 on lnx0047a is unrunnable (offline)
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking
> node lnx0047a unclean
> Aug 9 16:25:05 lnx0047b pengine: [22211]: WARN: stage6: Scheduling Node
> lnx0047a for STONITH
> Aug 9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints:
> lnx0102a_stop_0 is implicit after lnx0047a is fenced
> Aug 9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints:
> pingd:0_stop_0 is implicit after lnx0047a is fenced
> ....
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> lnx0047a: ee3d0c69-067a-423b-88bc-6d661a2b3254
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element:
> stonith_query: Query <stonith_command t="stonith-ng"
> st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_query"
> st_callid="0" st_callopt="0"
> st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a"
> st_device_action="reboot"
> st_clientid="eba960fb-ef44-4ffb-a017-d5e01177b4ec" src="lnx0047b" seq="32"
> />
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info:
> can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_query: Found 1
> matching devices for 'lnx0047a'
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_command:
> Processed st_query from lnx0047b: rc=1
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: call_remote_stonith:
> Requesting that lnx0047b perform op reboot lnx0047a
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element:
> stonith_fence: Exec <stonith_command t="stonith-ng"
> st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_fence"
> st_callid="0" st_callopt="0"
> st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a"
> st_device_action="reboot" src="lnx0047b" seq="34" />
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info:
> can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
> Aug 9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_fence: Found 1
> matching devices for 'lnx0047a'
> Aug 9 16:25:26 lnx0047b pengine: [22211]: WARN: process_pe_message:
> Transition 6: WARNINGs found during PE processing. PEngine Input stored
> in: /var/lib/pengine/pe-warn-102.bz2
> Aug 9 16:25:26 lnx0047b pengine: [22211]: info: process_pe_message:
> Configuration WARNINGs found during PE processing. Please run "crm_verify
> -L" to identify issues.
> Aug 9 16:25:26 lnx0047b sbd: [23278]: info: reset successfully delivered
> to lnx0047a
> Aug 9 16:25:27 lnx0047b sbd: [23845]: info: lnx0047a owns slot 1
> Aug 9 16:25:27 lnx0047b sbd: [23845]: info: Writing reset to node slot
> lnx0047a
> ....
> -------
> ps -eaf:
> ...
> root 24002 24001 0 16:25 ? 00:00:00 stonith -t external/sbd
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
> reset lnx0047a
> root 24007 24002 0 16:25 ? 00:00:00 /bin/bash
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
> root 24035 22192 0 16:25 ? 00:00:00
> /usr/lib64/heartbeat/stonithd
> ...
So far it looks normal.
> lnx0047a reboots successful, but during the image startup of images
> lnx0047a several stonith commands being executed on the online
> clusternode:
>
> $ ps -eaf|grep ston
> root 22207 22192 0 16:15 ? 00:00:00
> /usr/lib64/heartbeat/stonithd
> root 23272 23271 0 16:25 ? 00:00:00 stonith -t external/sbd
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
> reset lnx0047a
> root 23277 23272 0 16:25 ? 00:00:00 /bin/bash
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
> root 23340 23339 0 16:26 ? 00:00:00 stonith -t external/sbd
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
> reset lnx0047a
> root 23345 23340 0 16:26 ? 00:00:00 /bin/bash
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
> root 23438 23437 0 16:26 ? 00:00:00 stonith -t external/sbd
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T
> reset lnx0047a
> root 23443 23438 0 16:26 ? 00:00:00 /bin/bash
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
This looks strange.
> after lnx0047a is up again it get stonithed automatically by lnx0047b,
> althought the cluster isn't up and running (autostart watchdog)
>
> -----------------
> so, i'm unable to start lnx0047a until i manually kill alle the stonith
> processes on lnx0047b.
>
> during reboot-cycle on lnx0047a the Resources aren't able to start on
> lnx0047b:
>
> $ crm_verify -LV
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: pe_fence_node: Node lnx0047a
> will be fenced because it is un-expectedly down
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: determine_online_status: Node
> lnx0047a is unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
> lnx0101a_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
> lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
> pingd:0_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action
> sbd_fence:1_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: stage6: Scheduling Node
> lnx0047a for STONITH
>
> ###############
> any ideas on the stonith problem?
We'd need full logs. Can you please open a bugzilla and attach a
report generated by hb_report for the incident.
> any ideas on the "unrunnable" problem?
That's expected: one can't run operations on a node which is
offline.
Thanks,
Dejan
> regards
> ----------------
> Disclaimer:
> Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur
> für den Gebrauch des angesprochenen Adressaten bestimmt.
>
> This message is only for informational purposes and is intended solely for
> the use of the addressee.
> ----------------
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list