[Pacemaker] some questions about STONITH

Tue Nov 19 13:10:29 EST 2013

Hi everyone again.

I started training with STONITH.
I wrote a little STONITH external script.
Its basic moments:
* send the command "reboot" with SSH authentication using a key.
* The script takes a single argument - the path to the private key.
* Any node can send reboot any node (even yourself).

In the crm config it looks like this:
property $id="cib-bootstrap-options" \
        stonith-enabled="true"
primitive st1 stonith:external/sshbykey \
        params path2key="/opt/cluster_tools_2/keys/root at dev-cluster2-master" pcmk_host_check="none"
clone cloneStonith st1

Made the first test - Ok, node was rebooted and resource are started.
#export  path2key=/opt/cluster_tools_2/keys/root at dev-cluster2-master.unix.tensor.ru
# stonith -t external/sshbykey -E dev-cluster2-node1
info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: Now boot time 1384850888, send reboot

info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: Daration: 1340 sec.

info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset dev-cluster2-node1' output: GOOD NEWS: dev-cluster2-node1 booted in 1384864288

Do not worry about attention to the "Duration", this because of the jump time before synchronization time in the virtual machine and the server. Here the meaning of a change, rather than a specific number of seconds. Next time reboot  10 - 20 sec.

But farther, there are problems and questions. :)
1. 
Make next test:
#stonith_admin --reboot=dev-cluster2-node2
Node reboot, but resource don't start.
In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
And it will be hung.
Next, if I reboot this node in console, or stonith or stonith_admin (the same command!) - resources stats.

Portions of the logs:
   trace: unpack_status:        Processing node id=172793105, uname=dev-cluster2-node2
   trace: find_xml_node:        Could not find transient_attributes in node_state.
   trace: unpack_instance_attributes:   No instance attributes
   trace: unpack_status:        determining node state
   trace: determine_online_status_fencing:      dev-cluster2-node2: in_cluster=false, is_peer=online, join=down, expected=down, term=0
    info: determine_online_status_fencing:      - Node dev-cluster2-node2 is not ready to run resources
   trace: determine_online_status:      Node dev-cluster2-node2 is offline

   ........

   trace: unpack_status:        Processing lrm resource entries on healthy node: dev-cluster2-node2
   trace: find_xml_node:        Could not find lrm in node_state.
   trace: find_xml_node:        Could not find lrm_resources in <NULL>.
   trace: unpack_lrm_resources:         Unpacking resources on dev-cluster2-node2

   ..............
   trace: can_run_resources:    dev-cluster2-node2: online=0, unclean=0, standby=1, maintenance=0
   trace: check_actions:        Skipping param check for dev-cluster2-node2: cant run resources
.......
   trace: native_color:         Pre-allloc: VirtualIP allocation score on dev-cluster2-node2: 0
...........

      <node id="172793105" uname="dev-cluster2-node2">
        <instance_attributes id="nodes-172793105">
          <nvpair id="nodes-172793105-pgsql-data-status" name="pgsql-data-status" value="DISCONNECT"/>
          <nvpair id="nodes-172793105-standby" name="standby" value="false"/>
          <nvpair id="nodes-172793105-thisquorumnode" name="thisquorumnode" value="no"/>
        </instance_attributes>
      </node>

Why do that behavior?

2. 
There is a slight discrepancy in the Pacemaker Expl. and stonith_admin --help.
stonith_admin --reboot nodename. 
In one case, the sign of equality is, in other - no.
Not very important, because operate both.
But when you start to work and something goes wrong, do you think at all suspicious things. :)

3. 
Andrew! You promised post about STONITH debug.

4. (to ALL)
Also, please tell me the real arguments against the use of the SSH in STONITH.
I have my own guesses and thoughts, but I would like to know your experience.

My environment:
corosync-2.3.2
resource-agents-3.9.5
pacemaker 1.1.11
----
Thanks in advance,
Andrey Groshev