[Pacemaker] Antwort: Re: stonith sbd problem
Dejan Muhamedagic
dejanmm at fastmail.fm
Wed Aug 11 13:01:56 UTC 2010
Hi,
On Wed, Aug 11, 2010 at 11:48:17AM +0200, philipp.achmueller at arz.at wrote:
> i removed the clone, set the global cluster property for stonith-timeout.
>
> the nodes need about 3-5 minutes to startup after they get "shot"
>
> i did some more tests and found out that if the node, which runs resource
> sbd_fence, get "shot" the remaining node see the stonith resource online
> on both nodes (although one of the cluster-nodes is stonithed).
You meant to say "is going to be stonithed"? Anyway, this looks
like a bug. A minor one if it doesn't influence the fencing
action. Please file a bugzilla for this and attach hb_report.
> crm_mon:
> sbd_fence (stonith:external/sbd): Started [ lnx0047a lnx0047b ]
>
> looking through /var/log/messages:
>
> Aug 11 11:24:25 lnx0047a pengine: [20618]: info: determine_online_status:
> Node lnx0047a is online
> Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: pe_fence_node: Node
> lnx0047b will be fenced because it is un-expectedly down
> Aug 11 11:24:25 lnx0047a pengine: [20618]: info:
> determine_online_status_fencing: ha_state=active, ccm_state=false,
> crm_state=online, join_state=pending, expected=member
> Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: determine_online_status:
> Node lnx0047b is unclean
> Aug 11 11:24:25 lnx0047a pengine: [20618]: ERROR: native_add_running:
> Resource stonith::external/sbd:sbd_fence appears to be active on 2 nodes
> ...
> Aug 11 11:24:26 lnx0047a sbd: [22315]: info: lnx0047b owns slot 0
> Aug 11 11:24:26 lnx0047a sbd: [22315]: info: Writing reset to node slot
> lnx0047b
> Aug 11 11:24:26 lnx0047a sbd: [22318]: info: lnx0047b owns slot 0
> Aug 11 11:24:26 lnx0047a sbd: [22318]: info: Writing reset to node slot
> lnx0047b
Was the node fenced at this point? If not, are you sure that sbd
was functional?
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR:
> remote_op_query_timeout: Query 37724c6f-191f-407f-ad24-68028d2b6573 for
> lnx0047b timed out
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: remote_op_timeout:
> Action reboot (37724c6f-191f-407f-ad24-68028d2b6573) for lnx0047b timed
> out
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: remote_op_done:
> Notifing clients of 37724c6f-191f-407f-ad24-68028d2b6573 (reboot of
> lnx0047b from 11ea7c1e-6034-48e1-b616-a10c92e53a1d by (null)):
> 0, rc=-7
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: log_data_element:
> tengine_stonith_callback: StonithOp <remote-op state="0"
> st_target="lnx0047b" st_op="reboot" />
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: stonith_notify_client:
> Sending st_fence-notification to client
> 20619/15310d8c-6640-4799-8655-10d125b467bd
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_callback:
> Stonith operation 75/17:74:0:40ea951f-0c79-43af-8adb-adf8d6defe63:
> Operation timed out (-7)
This timeout seems to be just a few seconds, do you know why?
> Aug 11 11:24:28 lnx0047a crmd: [20619]: ERROR: tengine_stonith_callback:
> Stonith of lnx0047b failed (-7)... aborting transition.
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: abort_transition_graph:
> tengine_stonith_callback:402 - Triggered transition abort (complete=0) :
> Stonith failed
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort
> priority upgraded from 0 to 1000000
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort
> action done superceeded by restart
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_notify: Peer
> lnx0047b was terminated (reboot) by (null) for lnx0047a
> (ref=37724c6f-191f-407f-ad24-68028d2b6573): Operation timed out
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: run_graph:
> ====================================================
> Aug 11 11:24:28 lnx0047a crmd: [20619]: notice: run_graph: Transition 74
> (Complete=5, Pending=0, Fired=0, Skipped=5, Incomplete=1,
> Source=/var/lib/pengine/pe-error-942.bz2): Stopped
> ...
>
> this entries continue infinitely until i manually stop/start sbd_fence
> resource.
What happened when you did that?
> ------------
> still not sure why Ressource lnx0101a will not start on remaining node...
According to the logs above, the node reboot action failed, which
may be an explanation.
Thanks,
Dejan
> ----------------
> Disclaimer:
> Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur
> für den Gebrauch des angesprochenen Adressaten bestimmt.
>
> This message is only for informational purposes and is intended solely for
> the use of the addressee.
> ----------------
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list