[Pacemaker] Antwort: Re: stonith sbd problem

Wed Aug 11 13:01:56 UTC 2010

Hi,

On Wed, Aug 11, 2010 at 11:48:17AM +0200, philipp.achmueller at arz.at wrote:
> i removed the clone, set the global cluster property for stonith-timeout.
> 
> the nodes need about 3-5 minutes to startup after they get "shot"
> 
> i did some more tests and found out that if the node, which runs resource 
> sbd_fence, get "shot" the remaining node see the stonith resource online 
> on both nodes (although one of the cluster-nodes is stonithed).

You meant to say "is going to be stonithed"? Anyway, this looks
like a bug. A minor one if it doesn't influence the fencing
action. Please file a bugzilla for this and attach hb_report.

> crm_mon:
> sbd_fence       (stonith:external/sbd): Started [ lnx0047a lnx0047b ]
> 
> looking through /var/log/messages:
> 
> Aug 11 11:24:25 lnx0047a pengine: [20618]: info: determine_online_status: 
> Node lnx0047a is online
> Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: pe_fence_node: Node 
> lnx0047b will be fenced because it is un-expectedly down
> Aug 11 11:24:25 lnx0047a pengine: [20618]: info: 
> determine_online_status_fencing:       ha_state=active, ccm_state=false, 
> crm_state=online, join_state=pending, expected=member
> Aug 11 11:24:25 lnx0047a pengine: [20618]: WARN: determine_online_status: 
> Node lnx0047b is unclean
> Aug 11 11:24:25 lnx0047a pengine: [20618]: ERROR: native_add_running: 
> Resource stonith::external/sbd:sbd_fence appears to be active on 2 nodes
> ...
> Aug 11 11:24:26 lnx0047a sbd: [22315]: info: lnx0047b owns slot 0
> Aug 11 11:24:26 lnx0047a sbd: [22315]: info: Writing reset to node slot 
> lnx0047b
> Aug 11 11:24:26 lnx0047a sbd: [22318]: info: lnx0047b owns slot 0
> Aug 11 11:24:26 lnx0047a sbd: [22318]: info: Writing reset to node slot 
> lnx0047b

Was the node fenced at this point? If not, are you sure that sbd
was functional?

> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: 
> remote_op_query_timeout: Query 37724c6f-191f-407f-ad24-68028d2b6573 for 
> lnx0047b timed out
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: ERROR: remote_op_timeout: 
> Action reboot (37724c6f-191f-407f-ad24-68028d2b6573) for lnx0047b timed 
> out
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: remote_op_done: 
> Notifing clients of 37724c6f-191f-407f-ad24-68028d2b6573 (reboot of 
> lnx0047b from 11ea7c1e-6034-48e1-b616-a10c92e53a1d by (null)):
>  0, rc=-7
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: log_data_element: 
> tengine_stonith_callback: StonithOp <remote-op state="0" 
> st_target="lnx0047b" st_op="reboot" />
> Aug 11 11:24:28 lnx0047a stonith-ng: [20614]: info: stonith_notify_client: 
> Sending st_fence-notification to client 
> 20619/15310d8c-6640-4799-8655-10d125b467bd
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_callback: 
> Stonith operation 75/17:74:0:40ea951f-0c79-43af-8adb-adf8d6defe63: 
> Operation timed out (-7)

This timeout seems to be just a few seconds, do you know why?

> Aug 11 11:24:28 lnx0047a crmd: [20619]: ERROR: tengine_stonith_callback: 
> Stonith of lnx0047b failed (-7)... aborting transition.
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: abort_transition_graph: 
> tengine_stonith_callback:402 - Triggered transition abort (complete=0) : 
> Stonith failed
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort 
> priority upgraded from 0 to 1000000
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: update_abort_priority: Abort 
> action done superceeded by restart
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: tengine_stonith_notify: Peer 
> lnx0047b was terminated (reboot) by (null) for lnx0047a 
> (ref=37724c6f-191f-407f-ad24-68028d2b6573): Operation timed out
> Aug 11 11:24:28 lnx0047a crmd: [20619]: info: run_graph: 
> ====================================================
> Aug 11 11:24:28 lnx0047a crmd: [20619]: notice: run_graph: Transition 74 
> (Complete=5, Pending=0, Fired=0, Skipped=5, Incomplete=1, 
> Source=/var/lib/pengine/pe-error-942.bz2): Stopped
> ...
> 
> this entries continue infinitely until i manually stop/start sbd_fence 
> resource.

What happened when you did that?

> ------------
> still not sure why Ressource lnx0101a will not start on remaining node... 

According to the logs above, the node reboot action failed, which
may be an explanation.

Thanks,

Dejan

> ----------------
> Disclaimer:
> Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur 
> für den Gebrauch des angesprochenen Adressaten bestimmt.
> 
> This message is only for informational purposes and is intended solely for 
> the use of the addressee.
> ----------------

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker