[Pacemaker] some questions about STONITH

Tue Nov 19 13:12:19 EST 2013

SSH-based fencing isn't. A fence method can not assume that the target
is in any way functional. A quick way to see why is to crash a node with
'echo c > /proc/sysrq-trigger'.

digimer

On 19/11/13 13:10, Andrey Groshev wrote:
> Hi everyone again.
> 
> I started training with STONITH.
> I wrote a little STONITH external script.
> Its basic moments:
> * send the command "reboot" with SSH authentication using a key.
> * The script takes a single argument - the path to the private key.
> * Any node can send reboot any node (even yourself).
> 
> In the crm config it looks like this:
> property $id="cib-bootstrap-options" \
>         stonith-enabled="true"
> primitive st1 stonith:external/sshbykey \
>         params path2key="/opt/cluster_tools_2/keys/root at dev-cluster2-master" pcmk_host_check="none"
> clone cloneStonith st1
> 
> Made the first test - Ok, node was rebooted and resource are started.
> #export  path2key=/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru
> # stonith -t external/sshbykey -E dev-cluster2-node1
> info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: Now boot time 1384850888, send reboot
> 
> info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: Daration: 1340 sec.
> 
> info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: GOOD NEWS: dev-cluster2-node1 booted in 1384864288
> 
> Do not worry about attention to the "Duration", this because of the jump time before synchronization time in the virtual machine and the server. Here the meaning of a change, rather than a specific number of seconds. Next time reboot  10 - 20 sec.
> 
> But farther, there are problems and questions. :)
> 1. 
> Make next test:
> #stonith_admin --reboot=dev-cluster2-node2
> Node reboot, but resource don't start.
> In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
> And it will be hung.
> Next, if I reboot this node in console, or stonith or stonith_admin (the same command!) - resources stats.
> 
> Portions of the logs:
>    trace: unpack_status:        Processing node id=172793105, uname=dev-cluster2-node2
>    trace: find_xml_node:        Could not find transient_attributes in node_state.
>    trace: unpack_instance_attributes:   No instance attributes
>    trace: unpack_status:        determining node state
>    trace: determine_online_status_fencing:      dev-cluster2-node2: in_cluster=false, is_peer=online, join=down, expected=down, term=0
>     info: determine_online_status_fencing:      - Node dev-cluster2-node2 is not ready to run resources
>    trace: determine_online_status:      Node dev-cluster2-node2 is offline
> 
>    ........
>    
>    trace: unpack_status:        Processing lrm resource entries on healthy node: dev-cluster2-node2
>    trace: find_xml_node:        Could not find lrm in node_state.
>    trace: find_xml_node:        Could not find lrm_resources in <NULL>.
>    trace: unpack_lrm_resources:         Unpacking resources on dev-cluster2-node2
> 
>    ..............
>    trace: can_run_resources:    dev-cluster2-node2: online=0, unclean=0, standby=1, maintenance=0
>    trace: check_actions:        Skipping param check for dev-cluster2-node2: cant run resources
> .......
>    trace: native_color:         Pre-allloc: VirtualIP allocation score on dev-cluster2-node2: 0
> ...........
> 
> 
>       <node id="172793105" uname="dev-cluster2-node2">
>         <instance_attributes id="nodes-172793105">
>           <nvpair id="nodes-172793105-pgsql-data-status" name="pgsql-data-status" value="DISCONNECT"/>
>           <nvpair id="nodes-172793105-standby" name="standby" value="false"/>
>           <nvpair id="nodes-172793105-thisquorumnode" name="thisquorumnode" value="no"/>
>         </instance_attributes>
>       </node>
> 
> Why do that behavior?
> 
> 2. 
> There is a slight discrepancy in the Pacemaker Expl. and stonith_admin --help.
> stonith_admin --reboot nodename. 
> In one case, the sign of equality is, in other - no.
> Not very important, because operate both.
> But when you start to work and something goes wrong, do you think at all suspicious things. :)
> 
> 3. 
> Andrew! You promised post about STONITH debug.
> 
> 4. (to ALL)
> Also, please tell me the real arguments against the use of the SSH in STONITH.
> I have my own guesses and thoughts, but I would like to know your experience.
> 
> My environment:
> corosync-2.3.2
> resource-agents-3.9.5
> pacemaker 1.1.11
> ----
> Thanks in advance,
> Andrey Groshev
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?