[Pacemaker] stonith sbd problem

Tue Aug 10 10:56:17 UTC 2010

Hi,

On Tue, Aug 10, 2010 at 10:16:05AM +0200, philipp.achmueller at arz.at wrote:
> hi,
> 
> following configuration:
> 
> node lnx0047a
> node lnx0047b
> primitive lnx0101a ocf:heartbeat:KVM \
>         params name="lnx0101a" \
>         meta allow-migrate="1" target-role="Started" \
>         op migrate_from interval="0" timeout="3600s" \
>         op migrate_to interval="0" timeout="3600s" \
>         op monitor interval="10s" \
>         op stop interval="0" timeout="360s"
> primitive lnx0102a ocf:heartbeat:KVM \
>         params name="lnx0102a" \
>         meta allow-migrate="1" target-role="Started" \
>         op migrate_from interval="0" timeout="3600s" \
>         op migrate_to interval="0" timeout="3600s" \
>         op monitor interval="10s" \
>         op stop interval="0" timeout="360s"
> primitive pingd ocf:pacemaker:pingd \
>         params host_list="192.168.136.100" multiplier="100" \
>         op monitor interval="15s" timeout="5s"
> primitive sbd_fence stonith:external/sbd \
>         params sbd_device="/dev/hdisk-4652-38b5" stonith-timeout="60s"
> clone fence sbd_fence \
>         meta target-role="Started"

You shouldn't run sbd as a clone.

> clone pingdclone pingd \
>         meta globally-unique="false" target-role="Started"
> location lnx0101a_ip lnx0101a \
>         rule $id="lnx0101a_ip-rule" -inf: not_defined pingd or pingd lte 0
> location lnx0102a_ip lnx0102a \
>         rule $id="lnx0102a_ip-rule" -inf: not_defined pingd or pingd lte 0
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="true" \
>         stonith-action="reboot" \
>         no-quorum-policy="ignore" \
>         default-resource-stickiness="1000" \
>         last-lrm-refresh="1281364675"
> 
> -------------------------------
> during clustertest i disabled the interface where pingd ist listening on 
> node lnx0047a. i get "Node lnx0047a: UNCLEAN (offline)" on lnx0047b, the 
> stonith command is being executed:
> 
> /var/log/messages:
> ...
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: pe_fence_node: Node 
> lnx0047a will be fenced because it is un-expectedly down
> ...
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action 
> lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking 
> node lnx0047a unclean
> Aug  9 16:25:05 lnx0047b pengine: [22211]: notice: RecurringOp:  Start 
> recurring monitor (10s) for lnx0102a on lnx0047b
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action 
> pingd:0_stop_0 on lnx0047a is unrunnable (offline)
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking 
> node lnx0047a unclean
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action 
> sbd_fence:0_stop_0 on lnx0047a is unrunnable (offline)
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking 
> node lnx0047a unclean
> Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: stage6: Scheduling Node 
> lnx0047a for STONITH
> Aug  9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints: 
> lnx0102a_stop_0 is implicit after lnx0047a is fenced
> Aug  9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints: 
> pingd:0_stop_0 is implicit after lnx0047a is fenced
> ....
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: 
> initiate_remote_stonith_op: Initiating remote operation reboot for 
> lnx0047a: ee3d0c69-067a-423b-88bc-6d661a2b3254
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element: 
> stonith_query: Query <stonith_command t="stonith-ng" 
> st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_query" 
> st_callid="0" st_callopt="0" 
> st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a" 
> st_device_action="reboot" 
> st_clientid="eba960fb-ef44-4ffb-a017-d5e01177b4ec" src="lnx0047b" seq="32" 
> />
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: 
> can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_query: Found 1 
> matching devices for 'lnx0047a'
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_command: 
> Processed st_query from lnx0047b: rc=1
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: call_remote_stonith: 
> Requesting that lnx0047b perform op reboot lnx0047a
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element: 
> stonith_fence: Exec <stonith_command t="stonith-ng" 
> st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_fence" 
> st_callid="0" st_callopt="0" 
> st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a" 
> st_device_action="reboot" src="lnx0047b" seq="34" />
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: 
> can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
> Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_fence: Found 1 
> matching devices for 'lnx0047a'
> Aug  9 16:25:26 lnx0047b pengine: [22211]: WARN: process_pe_message: 
> Transition 6: WARNINGs found during PE processing. PEngine Input stored 
> in: /var/lib/pengine/pe-warn-102.bz2
> Aug  9 16:25:26 lnx0047b pengine: [22211]: info: process_pe_message: 
> Configuration WARNINGs found during PE processing.  Please run "crm_verify 
> -L" to identify issues.
> Aug  9 16:25:26 lnx0047b sbd: [23278]: info: reset successfully delivered 
> to lnx0047a
> Aug  9 16:25:27 lnx0047b sbd: [23845]: info: lnx0047a owns slot 1
> Aug  9 16:25:27 lnx0047b sbd: [23845]: info: Writing reset to node slot 
> lnx0047a
> ....
> -------
> ps -eaf:
> ...
> root     24002 24001  0 16:25 ?        00:00:00 stonith -t external/sbd 
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
> reset lnx0047a
> root     24007 24002  0 16:25 ?        00:00:00 /bin/bash 
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
> root     24035 22192  0 16:25 ?        00:00:00 
> /usr/lib64/heartbeat/stonithd
> ...

So far it looks normal.

> lnx0047a reboots successful, but during the image startup of images 
> lnx0047a several stonith commands being executed on the online 
> clusternode:
> 
> $ ps -eaf|grep ston
> root     22207 22192  0 16:15 ?        00:00:00 
> /usr/lib64/heartbeat/stonithd
> root     23272 23271  0 16:25 ?        00:00:00 stonith -t external/sbd 
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
> reset lnx0047a
> root     23277 23272  0 16:25 ?        00:00:00 /bin/bash 
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
> root     23340 23339  0 16:26 ?        00:00:00 stonith -t external/sbd 
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
> reset lnx0047a
> root     23345 23340  0 16:26 ?        00:00:00 /bin/bash 
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a
> root     23438 23437  0 16:26 ?        00:00:00 stonith -t external/sbd 
> sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
> reset lnx0047a
> root     23443 23438  0 16:26 ?        00:00:00 /bin/bash 
> /usr/lib64/stonith/plugins/external/sbd reset lnx0047a

This looks strange.

> after lnx0047a is up again it get stonithed automatically by lnx0047b, 
> althought the cluster isn't up and running (autostart watchdog)
> 
> -----------------
> so, i'm unable to start lnx0047a until i manually kill alle the stonith 
> processes on lnx0047b. 
> 
> during reboot-cycle on lnx0047a the Resources aren't able to start on 
> lnx0047b:
> 
> $ crm_verify -LV
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: pe_fence_node: Node lnx0047a 
> will be fenced because it is un-expectedly down
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: determine_online_status: Node 
> lnx0047a is unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
> lnx0101a_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
> lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
> pingd:0_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
> sbd_fence:1_stop_0 on lnx0047a is unrunnable (offline)
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
> lnx0047a unclean
> crm_verify[27816]: 2010/08/09_16:25:41 WARN: stage6: Scheduling Node 
> lnx0047a for STONITH
> 
> ###############
> any ideas on the stonith problem?

We'd need full logs. Can you please open a bugzilla and attach a
report generated by hb_report for the incident.

> any ideas on the "unrunnable" problem?

That's expected: one can't run operations on a node which is
offline.

Thanks,

Dejan

> regards
> ----------------
> Disclaimer:
> Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur 
> für den Gebrauch des angesprochenen Adressaten bestimmt.
> 
> This message is only for informational purposes and is intended solely for 
> the use of the addressee.
> ----------------

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker