[Pacemaker] some questions about STONITH
Digimer
lists at alteeve.ca
Tue Nov 19 18:12:19 UTC 2013
SSH-based fencing isn't. A fence method can not assume that the target
is in any way functional. A quick way to see why is to crash a node with
'echo c > /proc/sysrq-trigger'.
digimer
On 19/11/13 13:10, Andrey Groshev wrote:
> Hi everyone again.
>
> I started training with STONITH.
> I wrote a little STONITH external script.
> Its basic moments:
> * send the command "reboot" with SSH authentication using a key.
> * The script takes a single argument - the path to the private key.
> * Any node can send reboot any node (even yourself).
>
> In the crm config it looks like this:
> property $id="cib-bootstrap-options" \
> stonith-enabled="true"
> primitive st1 stonith:external/sshbykey \
> params path2key="/opt/cluster_tools_2/keys/root at dev-cluster2-master" pcmk_host_check="none"
> clone cloneStonith st1
>
> Made the first test - Ok, node was rebooted and resource are started.
> #export path2key=/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru
> # stonith -t external/sshbykey -E dev-cluster2-node1
> info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: Now boot time 1384850888, send reboot
>
> info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: Daration: 1340 sec.
>
> info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: GOOD NEWS: dev-cluster2-node1 booted in 1384864288
>
> Do not worry about attention to the "Duration", this because of the jump time before synchronization time in the virtual machine and the server. Here the meaning of a change, rather than a specific number of seconds. Next time reboot 10 - 20 sec.
>
> But farther, there are problems and questions. :)
> 1.
> Make next test:
> #stonith_admin --reboot=dev-cluster2-node2
> Node reboot, but resource don't start.
> In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
> And it will be hung.
> Next, if I reboot this node in console, or stonith or stonith_admin (the same command!) - resources stats.
>
> Portions of the logs:
> trace: unpack_status: Processing node id=172793105, uname=dev-cluster2-node2
> trace: find_xml_node: Could not find transient_attributes in node_state.
> trace: unpack_instance_attributes: No instance attributes
> trace: unpack_status: determining node state
> trace: determine_online_status_fencing: dev-cluster2-node2: in_cluster=false, is_peer=online, join=down, expected=down, term=0
> info: determine_online_status_fencing: - Node dev-cluster2-node2 is not ready to run resources
> trace: determine_online_status: Node dev-cluster2-node2 is offline
>
> ........
>
> trace: unpack_status: Processing lrm resource entries on healthy node: dev-cluster2-node2
> trace: find_xml_node: Could not find lrm in node_state.
> trace: find_xml_node: Could not find lrm_resources in <NULL>.
> trace: unpack_lrm_resources: Unpacking resources on dev-cluster2-node2
>
> ..............
> trace: can_run_resources: dev-cluster2-node2: online=0, unclean=0, standby=1, maintenance=0
> trace: check_actions: Skipping param check for dev-cluster2-node2: cant run resources
> .......
> trace: native_color: Pre-allloc: VirtualIP allocation score on dev-cluster2-node2: 0
> ...........
>
>
> <node id="172793105" uname="dev-cluster2-node2">
> <instance_attributes id="nodes-172793105">
> <nvpair id="nodes-172793105-pgsql-data-status" name="pgsql-data-status" value="DISCONNECT"/>
> <nvpair id="nodes-172793105-standby" name="standby" value="false"/>
> <nvpair id="nodes-172793105-thisquorumnode" name="thisquorumnode" value="no"/>
> </instance_attributes>
> </node>
>
> Why do that behavior?
>
> 2.
> There is a slight discrepancy in the Pacemaker Expl. and stonith_admin --help.
> stonith_admin --reboot nodename.
> In one case, the sign of equality is, in other - no.
> Not very important, because operate both.
> But when you start to work and something goes wrong, do you think at all suspicious things. :)
>
> 3.
> Andrew! You promised post about STONITH debug.
>
> 4. (to ALL)
> Also, please tell me the real arguments against the use of the SSH in STONITH.
> I have my own guesses and thoughts, but I would like to know your experience.
>
> My environment:
> corosync-2.3.2
> resource-agents-3.9.5
> pacemaker 1.1.11
> ----
> Thanks in advance,
> Andrey Groshev
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Pacemaker
mailing list