[Pacemaker] "stonith_admin -F node" results in a pair of reboots

Tue Jan 7 08:21:54 UTC 2014

On 1/6/2014 6:24 PM, Bob Haxo wrote:
> Hi Fabio,
> 
>>> There is an example on how to configure gfs2 also in the rhel6.5
>>> pacemaker documentation, using pcs.
> 
> Super!  Please share the link to this documentation.  I only discovered
> the gfs2+pcs example with the rhel7 beta docs.

You are right, the gfs2 example was not published in Rev 1 of the
pacemaker documentation for RHEL6.5. It´s entirely possible I missed it
during doc review, sorry about that!

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Configuring_the_Red_Hat_High_Availability_Add-On_with_Pacemaker/index.html

Short version is:

chkconfig cman on
chkconfig clvmd on
chkconfig pacemaker on

Use the above doc to setup / start the cluster (stop after stonith config)

Setup your clvmd storage (note that neither dlm or clvmd are managed by
pacemaker in RHEL6.5 vs RHEL7 where it´s all managed by pacemaker).

Start adding your resources/services here etc...

Also, make absolutely sure you have all the latest updates from 6.5
Erratas installed.

Fabio

> 
> Bob Haxo
> 
> 
> 
> On Sat, 2014-01-04 at 16:56 +0100, Fabio M. Di Nitto wrote:
>> On 01/01/2014 01:57 AM, Bob Haxo wrote:
>> > Greetings ... Happy New Year!
>> > 
>> > I am testing a configuration that is created from example in "Chapter 6.
>> > Configuring a GFS2 File System in a Cluster" of the "Red Hat Enterprise
>> > Linux 7.0 Beta Global File System 2" document.  Only addition is
>> > stonith:fence_ipmilan.  After encountering this issue when I configured
>> > with "crm", I re-configured using "pcs". I've included the configuration
>> > below.
>>
>> Hold on a second here.. why are you using RHEL7 documentation to
>> configure RHEL6.5? Please don't mix :) there are some differences and we
>> definitely never tested mixing those up.
>>
>> There is an example on how to configure gfs2 also in the rhel6.5
>> pacemaker documentation, using pcs.
>>
>> I personally never saw this behaviour, so it's entirely possible that
>> mixing things up will result in unpredictable status.
>>
>> Fabio
>>
>> > 
>> > I'm thinking that, in a 2-node cluster, if I run "stonith_admin -F
>> > <peer-node>", then <peer-node> should reboot and cleanly rejoin the
>> > cluster.  This is not happening. 
>> > 
>> > What ultimately happens is that after the initially fenced node reboots,
>> > the system from which the stonith_admin -F command was run is fenced and
>> > reboots. The fencing stops there, leaving the cluster in an appropriate
>> > state.
>> > 
>> > The issue seems to reside with clvmd/lvm.  With the reboot of the
>> > initially fenced node, the clvmd resource fails on the surviving node,
>> > with a maximum of errors.  I hypothesize there is an issue with locks,
>> > but have insufficient knowledge of clvmd/lvm locks to prove or disprove
>> > this hypothesis.
>> > 
>> > Have I missed something ...
>> > 
>> > 1) Is this expected behavior, and always the reboot of the fencing node
>> > happens?
>> > 
>> > 2) Or, maybe I didn't correctly duplicate the Chapter 6 example?
>> > 
>> > 3) Or, perhaps something is wrong or omitted from the Chapter 6 example?
>> > 
>> > Suggestions will be much appreciated.
>> > 
>> > Thanks,
>> > Bob Haxo
>> > 
>> > RHEL6.5
>> > pacemaker-cli-1.1.10-14.el6_5.1.x86_64
>> > crmsh-1.2.5-55.1sgi709r3.rhel6.x86_64
>> > pacemaker-libs-1.1.10-14.el6_5.1.x86_64
>> > cman-3.0.12.1-59.el6_5.1.x86_64
>> > pacemaker-1.1.10-14.el6_5.1.x86_64
>> > corosynclib-1.4.1-17.el6.x86_64
>> > corosync-1.4.1-17.el6.x86_64
>> > pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
>> > 
>> > Cluster Name: mici
>> > Corosync Nodes:
>> > 
>> > Pacemaker Nodes:
>> > mici-admin mici-admin2
>> > 
>> > Resources:
>> > Clone: clusterfs-clone
>> >   Meta Attrs: interleave=true target-role=Started
>> >   Resource: clusterfs (class=ocf provider=heartbeat type=Filesystem)
>> >    Attributes: device=/dev/vgha2/lv_clust2 directory=/images fstype=gfs2
>> > options=defaults,noatime,nodiratime
>> >    Operations: monitor on-fail=fence interval=30s
>> > (clusterfs-monitor-interval-30s)
>> > Clone: clvmd-clone
>> >   Meta Attrs: interleave=true ordered=true target-role=Started
>> >   Resource: clvmd (class=lsb type=clvmd)
>> >    Operations: monitor on-fail=fence interval=30s
>> > (clvmd-monitor-interval-30s)
>> > Clone: dlm-clone
>> >   Meta Attrs: interleave=true ordered=true
>> >   Resource: dlm (class=ocf provider=pacemaker type=controld)
>> >    Operations: monitor on-fail=fence interval=30s (dlm-monitor-interval-30s)
>> > 
>> > Stonith Devices:
>> > Resource: p_ipmi_fencing_1 (class=stonith type=fence_ipmilan)
>> >   Attributes: ipaddr=128.##.##.78 login=XXXXX passwd=XXXXX lanplus=1
>> > action=reboot pcmk_host_check=static-list pcmk_host_list=mici-admin
>> >   Meta Attrs: target-role=Started
>> >   Operations: monitor start-delay=30 interval=60s timeout=30
>> > (p_ipmi_fencing_1-monitor-60s)
>> > Resource: p_ipmi_fencing_2 (class=stonith type=fence_ipmilan)
>> >   Attributes: ipaddr=128.##.##.220 login=XXXXX passwd=XXXXX lanplus=1
>> > action=reboot pcmk_host_check=static-list pcmk_host_list=mici-admin2
>> >   Meta Attrs: target-role=Started
>> >   Operations: monitor start-delay=30 interval=60s timeout=30
>> > (p_ipmi_fencing_2-monitor-60s)
>> > Fencing Levels:
>> > 
>> > Location Constraints:
>> >   Resource: p_ipmi_fencing_1
>> >     Disabled on: mici-admin (score:-INFINITY)
>> > (id:location-p_ipmi_fencing_1-mici-admin--INFINITY)
>> >   Resource: p_ipmi_fencing_2
>> >     Disabled on: mici-admin2 (score:-INFINITY)
>> > (id:location-p_ipmi_fencing_2-mici-admin2--INFINITY)
>> > Ordering Constraints:
>> >   start dlm-clone then start clvmd-clone (Mandatory)
>> > (id:order-dlm-clone-clvmd-clone-mandatory)
>> >   start clvmd-clone then start clusterfs-clone (Mandatory)
>> > (id:order-clvmd-clone-clusterfs-clone-mandatory)
>> > Colocation Constraints:
>> >   clusterfs-clone with clvmd-clone (INFINITY)
>> > (id:colocation-clusterfs-clone-clvmd-clone-INFINITY)
>> >   clvmd-clone with dlm-clone (INFINITY)
>> > (id:colocation-clvmd-clone-dlm-clone-INFINITY)
>> > 
>> > Cluster Properties:
>> > cluster-infrastructure: cman
>> > dc-version: 1.1.10-14.el6_5.1-368c726
>> > last-lrm-refresh: 1388530552
>> > no-quorum-policy: ignore
>> > stonith-enabled: true
>> > Node Attributes:
>> > mici-admin: standby=off
>> > mici-admin2: standby=off
>> > 
>> > 
>> > Last updated: Tue Dec 31 17:15:55 2013
>> > Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
>> > Stack: cman
>> > Current DC: mici-admin2 - partition with quorum
>> > Version: 1.1.10-14.el6_5.1-368c726
>> > 2 Nodes configured
>> > 8 Resources configured
>> > 
>> > Online: [ mici-admin mici-admin2 ]
>> > 
>> > Full list of resources:
>> > 
>> > p_ipmi_fencing_1        (stonith:fence_ipmilan):        Started mici-admin2
>> > p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started mici-admin
>> > Clone Set: clusterfs-clone [clusterfs]
>> >      Started: [ mici-admin mici-admin2 ]
>> > Clone Set: clvmd-clone [clvmd]
>> >      Started: [ mici-admin mici-admin2 ]
>> > Clone Set: dlm-clone [dlm]
>> >      Started: [ mici-admin mici-admin2 ]
>> > 
>> > Migration summary:
>> > * Node mici-admin:
>> > * Node mici-admin2:
>> > 
>> > =====================================================
>> > crm_mon  after the fenced node reboots.  Shows the failure of clvmd that
>> > then
>> > occurs, which in turn triggers a fencing of that nnode
>> > 
>> > Last updated: Tue Dec 31 17:06:55 2013
>> > Last change: Tue Dec 31 16:57:37 2013 via cibadmin on mici-admin
>> > Stack: cman
>> > Current DC: mici-admin - partition with quorum
>> > Version: 1.1.10-14.el6_5.1-368c726
>> > 2 Nodes configured
>> > 8 Resources configured
>> > 
>> > Node mici-admin: UNCLEAN (online)
>> > Online: [ mici-admin2 ]
>> > 
>> > Full list of resources:
>> > 
>> > p_ipmi_fencing_1        (stonith:fence_ipmilan):        Stopped
>> > p_ipmi_fencing_2        (stonith:fence_ipmilan):        Started mici-admin
>> > Clone Set: clusterfs-clone [clusterfs]
>> >      Started: [ mici-admin ]
>> >      Stopped: [ mici-admin2 ]
>> > Clone Set: clvmd-clone [clvmd]
>> >      clvmd      (lsb:clvmd):    FAILED mici-admin
>> >      Stopped: [ mici-admin2 ]
>> > Clone Set: dlm-clone [dlm]
>> >      Started: [ mici-admin mici-admin2 ]
>> > 
>> > Migration summary:
>> > * Node mici-admin:
>> >    clvmd: migration-threshold=1000000 fail-count=1 last-failure='Tue Dec
>> > 31 17:04:29 2013'
>> > * Node mici-admin2:
>> > 
>> > Failed actions:
>> >     clvmd_monitor_30000 on mici-admin 'unknown error' (1): call=60,
>> > status=Timed Out, la
>> > st-rc-change='Tue Dec 31 17:04:29 2013', queued=0ms, exec=0ms
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > 
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> > 
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org <mailto:Pacemaker at oss.clusterlabs.org>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>