[Pacemaker] GFS2 with Pacemaker on RHEL6.3 restarts with reboot

Sun Aug 12 22:04:20 EDT 2012

On Mon, Aug 13, 2012 at 11:27 AM, Bob Haxo <bhaxo at sgi.com> wrote:
>
> On Fri, 2012-08-10 at 12:21 +1000, Andrew Beekhof wrote:
>> On Thu, Aug 9, 2012 at 12:14 PM, Bob Haxo <bhaxo at sgi.com> wrote:
>> > Greetings.
>> >
>> > I have followed the setup instructions of Clusters From Scratch :
>> > Creating Active/Passive and Active/Active Clusters on Fedora, Edition 5,
>> > including locating the new cman pages that do not seem to be linked into
>> > the main document, for example,
>> >
>> > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02s02.html
>>
>> The 1.1 document was updated for corosync 2.x
>> I kept the cman/plugin version around but moved it to:
>>
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/index.html
>>
>> Look for "Version: 1.1-plugin" on the main docs page.
>
> Andrew, much thanks for the response ... and much thanks here ... I had
> not connected the dots regarding use of cman being an *earlier* version
> of the docs (and software stack).
>
>>
>> >
>> > The stack that I'm implementing includes RHEL6.3, drbd, dlm, gfs2,
>> > Pacemaker (RHEL6.3 build), cman, kvm ... hopefully I didn't leave
>> > anybody off the party list.
>> >
>> > I have these all working together to support "live" migration of the
>> > virt client between the two phys hosts, so at that level, all is good.
>> >
>> > Questions: Is there a document that covers the fully covers such an
>> > installation, meaning the extends the Cluster From Scratch (and replaces
>> > the Apache example) to implementation of a HA virtual client? For
>> > instance, should libvirtd be handled as a Pacemaker resource, or should
>> > it be started as an system service at boot?  What should be done with
>> > "libvirt-guests"?
>>
>> These things I do not know sorry.
>>
>> >  Should cman be started as a system service at boot?
>>
>> I prefer not to, but its just a personal preference.
>> I run potentially broken versions of the cluster and have been hit
>> hard before with processes running amok and putting machines into
>> reboot cycles.
>
> Ah, right.  I too in my testing start cman and pacemaker manually.  I
> was thinking more of when moving from testing to production.  I think
> you have answered that.
>
>>
>> >
>> > Problem: When the the non-VM-host is rebooted, then when Pacemaker
>> > restarts the gfs2 filesystem gets restarted on the VM host, which causes
>> > the stop and start of the VirtualDomain. The gfs2 filesystem also gets
>> > restarted without of the VirtualDomain resource included.
>>
>> This sounds like the "starting a clone on A causes a restart of the
>> clone on B" bug.
>> I think we've squashed that one now but not in a released version...
>> how confident are you at creating rpms?
>
> :-)  Well "how confident" depends upon the precise meaning of "creating
> rpms"  .. if this is building a rpm given a working spec file, then that
> I can do. If it is a matter of making mods to an almost working spec
> file, that I can do. If it involves creating the spec file from scratch
> for a large project, that would be a challenge.

Yeah, that would be asking a bit much :)

Depending on how "clean" the machine you're working on is, and if its
running the same software versions as the machine that the results
will be installed on, you /should/ be able to check out the latest git
and run 'make rpm'.
Otherwise you might need to set up mock and run something like 'make
mock-epel-6-x86_64' from the top of the latest pacemaker git tree.

>
> FYI, I'm trying to get Pacemaker accepted for use in a product rather
> than rgmanager.
>
> Thanks, Andrew.
> Bob Haxo
> bhaxo at sgi.com
>
>>
>> > This behavior does not seem correct ... I think I would have flagged it
>> > in my memory if I'd encountered the behavior when working with the SLES
>> > HAE product.  I've been doing a lot of fumbling this past week trying to
>> > get the colocation and order statements correct, without affecting this
>> > behavior.
>> >
>> > What am I missing?
>> >
>> > Here are the first indications of this restart issue during the restart
>> > of Pacemaker and friends with the boot.  I have attached more messages.
>> >
>> > Aug  8 20:00:57 hikari crmd[2734]:     info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-hikari2-master-drbd_r0.1, name=master-drbd_r0:1, value=5, magic=NA, cib=0.474.170) : Transient attribute: update
>> > Aug  8 20:00:57 hikari crmd[2734]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> > Aug  8 20:00:57 hikari pengine[2733]:   notice: unpack_config: On loss of CCM Quorum: Ignore
>> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Promote drbd_r0:1#011(Slave -> Master hikari2)
>> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Restart virt#011(Started hikari) <<<<<<<<<<<<<<<<<<
>> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Restart shared-gfs2:0#011(Started hikari)  <<<<<<<<
>> > Aug  8 20:00:57 hikari pengine[2733]:   notice: LogActions: Start   shared-gfs2:1#011(hikari2)
>> > Aug  8 20:00:57 hikari crmd[2734]:     info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-hikari2-master-drbd_r1.1, name=master-drbd_r1:1, value=5, magic=NA, cib=0.474.171) : Transient attribute: update
>> >
>> > Here are the current constraints resulting from fumbling (actually,
>> > trying to make sense of all of the information obtained from a Google
>> > searches):
>> >
>> > colocation co-gfs-on-drbd inf: c_shared-gfs2 drbd_r0_clone:Master
>> > order o-drbd_r0-then-gfs inf: drbd_r0_clone:promote c_shared-gfs2:start
>> > order o-drbd_r1_clone-then-virt inf: drbd_r1_clone virt
>> > order o-gfs-then-virt inf: c_shared-gfs2 virt
>> >
>> > Full config file attached.
>> >
>> > For reference, here is "service blah status" for the set of services:
>> >
>> > [root at hikari2 ~]# ha-status
>> > ------- service corosync status -------
>> > corosync (pid  1996) is running...
>> > ------- service cman status -------
>> > cluster is running.
>> > ------- service drbd status -------
>> > drbd driver loaded OK; device status:
>> > version: 8.4.1 (api:1/proto:86-100)
>> > GIT-hash: 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
>> > phil at Build64R6, 2012-04-17 11:28:08
>> > m:res  cs         ro               ds                 p  mounted  fstype
>> > 1:r0   Connected  Primary/Primary  UpToDate/UpToDate  C  /shared  gfs2
>> > 2:r1   Connected  Primary/Primary  UpToDate/UpToDate  C
>> > 3:r2   Connected  Primary/Primary  UpToDate/UpToDate  C
>> > ------- service pacemaker status -------
>> > pacemakerd (pid  8912) is running...
>> > ------- service gfs2 status -------
>> > Configured GFS2 mountpoints:
>> > /shared
>> > Active GFS2 mountpoints:
>> > /shared
>> > ------- service libvirtd status -------
>> > libvirtd (pid  2510) is running...
>> >
>> > [root at hikari ~]# crm_mon -1ro
>> > ============
>> > Last updated: Wed Aug  8 21:01:47 2012
>> > Last change: Wed Aug  8 20:48:49 2012 via cibadmin on hikari
>> > Stack: cman
>> > Current DC: hikari - partition with quorum
>> > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
>> > 2 Nodes configured, 2 expected votes
>> > 11 Resources configured.
>> > ============
>> >
>> > Online: [ hikari hikari2 ]
>> >
>> > Full list of resources:
>> >
>> >  Master/Slave Set: drbd_r0_clone [drbd_r0]
>> >      Masters: [ hikari hikari2 ]
>> >  Master/Slave Set: drbd_r1_clone [drbd_r1]
>> >      Masters: [ hikari hikari2 ]
>> >  Master/Slave Set: drbd_r2_clone [drbd_r2]
>> >      Masters: [ hikari hikari2 ]
>> >  ipmi-fencing-1 (stonith:fence_ipmilan):        Started hikari
>> >  ipmi-fencing-2 (stonith:fence_ipmilan):        Started hikari2
>> >  virt   (ocf::heartbeat:VirtualDomain): Started hikari
>> >  Clone Set: c_shared-gfs2 [shared-gfs2]
>> >      Started: [ hikari hikari2 ]
>> >
>> > Operations:
>> > * Node hikari2:
>> >    drbd_r1:1: migration-threshold=1000000
>> >     + (17) monitor: interval=60000ms rc=0 (ok)
>> >     + (26) promote: rc=0 (ok)
>> >    drbd_r0:1: migration-threshold=1000000
>> >     + (21) promote: rc=0 (ok)
>> >    drbd_r2:1: migration-threshold=1000000
>> >     + (19) monitor: interval=60000ms rc=0 (ok)
>> >     + (27) promote: rc=0 (ok)
>> >    ipmi-fencing-2: migration-threshold=1000000
>> >     + (12) start: rc=0 (ok)
>> >     + (13) monitor: interval=240000ms rc=0 (ok)
>> >    shared-gfs2:1: migration-threshold=1000000
>> >     + (25) start: rc=0 (ok)
>> > * Node hikari:
>> >    drbd_r1:0: migration-threshold=1000000
>> >     + (24) promote: rc=0 (ok)
>> >    drbd_r2:0: migration-threshold=1000000
>> >     + (25) promote: rc=0 (ok)
>> >    shared-gfs2:0: migration-threshold=1000000
>> >     + (92) start: rc=0 (ok)
>> >    drbd_r0:0: migration-threshold=1000000
>> >     + (23) promote: rc=0 (ok)
>> >    ipmi-fencing-1: migration-threshold=1000000
>> >     + (12) start: rc=0 (ok)
>> >     + (13) monitor: interval=240000ms rc=0 (ok)
>> >    virt: migration-threshold=1000000
>> >     + (120) start: rc=0 (ok)
>> >     + (121) monitor: interval=10000ms rc=0 (ok)
>> >
>> > Thanks for reading ...
>> > Bob Haxo
>> > bhaxo @ sgi.com
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org