[Pacemaker] no actions on lost CIB access

Tue May 6 17:27:52 UTC 2014

On Mon, May 5, 2014 at 7:32 PM, Andrew Beekhof <andrew at beekhof.net> wrote:

>
> On 3 May 2014, at 6:20 am, Radoslaw Garbacz <
> radoslaw.garbacz at xtremedatainc.com> wrote:
>
> > Hi,
> >
> > I have a strange situation, which I would like to ask about, whether it
> is a bug, misconfiguration or an intended behavior.
>
> Sort version: Thats not a valid test
> Medium version: Thats not a valid test and there are updates available for
> pacemaker in el6
> Long version: Using iptables in this way not only stops the cluster from
> seeing its peer, but also stops the cluster from talking to itself on the
> same node.  At which point nothing will work.
>
> Did you configure fencing?
>

Yes.
Thank you for your suggestions and help.

>
> >
> > A disconnected node does not detect it is lost, and does not perform any
> actions to stop, even though resource agents report errors when monitored,
> just the number of processes (of some hanged resource agents) keeps growing.
> >
> > Seems like pacemaker ignores timeouts when trying to update CIB.
> >
> > The situation is caused by corosync not detecting lost quorum due to
> firewall blocking lo. As far as I checked this prevents corosync from
> detecting problems with the cluster, and when lo access is restored
> everything should be fine, but shouldn't pacemaker detect lost CIB service
> and do something about it? Maybe there is a configuration parameter to
> control this?
> >
> > Technical details:
> >
> > 1)
> > 1.1) machine: Amazon Linux: Linux ... 3.10.35-43.137.amzn1.x86_64 #1
> SMP Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> > 1.2) Pacemaker:  Pacemaker 1.1.9-1512.el6
> > 1.3) corosync: Corosync Cluster Engine, version '2.3.2'
> >
> >
> > 2) Net: basic: ethx, lo
> >
> > 3) iptables:
> > *filter
> > :INPUT ACCEPT [0:0]
> > :FORWARD ACCEPT [0:0]
> > :OUTPUT ACCEPT [0:0]
> > -A INPUT -p tcp -m tcp -s <my_machine> --dport 22 -j ACCEPT
> > -A INPUT -j DROP
> > -A OUTPUT -p tcp -m tcp -d <my_machine> --sport 22 -j ACCEPT
> > -A OUTPUT -j DROP
> > COMMIT
> >
> > 4) crm config:
> > <crm_config>
> >   <cluster_property_set id="cib-bootstrap-options">
> >     <nvpair id="cib-bootstrap-options-stonith-enabled"
> name="stonith-enabled" value="false"/>
> >     <nvpair id="cib-bootstrap-options-no-quorum-policy"
> name="no-quorum-policy" value="stop"/>
> >     <nvpair id="cib-bootstrap-options-stop-orphan-resources"
> name="stop-orphan-resources" value="true"/>
> >     <nvpair id="cib-bootstrap-options-start-failure-is-fatal"
> name="start-failure-is-fatal" value="true"/>
> >     <nvpair id="cib-bootstrap-options-expected-quorum-votes"
> name="expected-quorum-votes" value="3"/>
> >     <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.9-1512.el6-2a917dd"/>
> >     <nvpair id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="corosync"/>
> >   </cluster_property_set>
> > </crm_config>
> >
> >
> > 5) Example resource config:
> >     <primitive class="ocf" id="dbx_ready_nodes" provider="dbxcl" type="
> ready.ocf.sh">
> >       <instance_attributes id="dbx_ready_nodes-instance_attributes">
> >         <nvpair id="dbx_ready_nodes-instance_attributes-dbxclrole"
> name="dbxclrole" value="''"/>
> >       </instance_attributes>
> >       <operations>
> >         <op id="dbx_ready_nodes-start-timeout-1min-on-fail-stop"
> interval="0s" name="start" on-fail="stop" timeout="1min"/>
> >         <op id="dbx_ready_nodes-stop-timeout-8min" interval="0s"
> name="stop" timeout="8min"/>
> >         <op id="dbx_ready_nodes-monitor-interval-83s" interval="83s"
> name="monitor" on-fail="stop" timeout="60s"/>
> >         <op id="dbx_ready_nodes-validate-all-interval-29s"
> interval="29s" name="validate-all" on-fail="stop" timeout="60s"/>
> >       </operations>
> >     </primitive>
> >
> >
> > 6) Logs:
> > Below a resource "dbx_ready_nodes" monitor action returns error, but
> nothing happens, the resource is not being requested to stop (even though
> it should, as can be seen above)
> >
> > May 02 20:04:13 [16191] ip-10-116-169-85       lrmd:    debug:
> operation_finished:      dbx_ready_nodes_monitor_83000:8669 - exited with
> rc=1
> > May 02 20:04:13 [16191] ip-10-116-169-85       lrmd:    debug:
> log_finished:    finished - rsc:dbx_ready_nodes action:monitor call_id:142
> pid:8669 exit-code:1 exec-time:0ms queue-time:0ms
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
> sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
> sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
> sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
> sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
> sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ]
> sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> > May 02 20:04:13 [16154] ip-10-116-169-85 corosync warning [MAIN  ] Totem
> is unable to form a cluster because of an operating system or network
> fault. The most common cause of this message is that th
> > e local firewall is configured improperly.
> >
> >
> > Thanks in advance
> >
> > --
> > Best Regards,
> >
> > Radoslaw Garbacz
> > XtremeData Incorporation
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140506/5c8d492e/attachment.htm>