[Pacemaker] no actions on lost CIB access

Tue May 6 00:32:58 UTC 2014

On 3 May 2014, at 6:20 am, Radoslaw Garbacz <radoslaw.garbacz at xtremedatainc.com> wrote:

> Hi,
> 
> I have a strange situation, which I would like to ask about, whether it is a bug, misconfiguration or an intended behavior.

Sort version: Thats not a valid test
Medium version: Thats not a valid test and there are updates available for pacemaker in el6
Long version: Using iptables in this way not only stops the cluster from seeing its peer, but also stops the cluster from talking to itself on the same node.  At which point nothing will work.

Did you configure fencing?

> 
> A disconnected node does not detect it is lost, and does not perform any actions to stop, even though resource agents report errors when monitored, just the number of processes (of some hanged resource agents) keeps growing.
> 
> Seems like pacemaker ignores timeouts when trying to update CIB.
> 
> The situation is caused by corosync not detecting lost quorum due to firewall blocking lo. As far as I checked this prevents corosync from detecting problems with the cluster, and when lo access is restored everything should be fine, but shouldn't pacemaker detect lost CIB service and do something about it? Maybe there is a configuration parameter to control this?
> 
> Technical details:
> 
> 1)
> 1.1) machine: Amazon Linux: Linux ... 3.10.35-43.137.amzn1.x86_64 #1 SMP Wed Apr 2 09:36:59 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> 1.2) Pacemaker:  Pacemaker 1.1.9-1512.el6
> 1.3) corosync: Corosync Cluster Engine, version '2.3.2'
> 
> 
> 2) Net: basic: ethx, lo
> 
> 3) iptables:
> *filter
> :INPUT ACCEPT [0:0]
> :FORWARD ACCEPT [0:0]
> :OUTPUT ACCEPT [0:0]
> -A INPUT -p tcp -m tcp -s <my_machine> --dport 22 -j ACCEPT
> -A INPUT -j DROP
> -A OUTPUT -p tcp -m tcp -d <my_machine> --sport 22 -j ACCEPT
> -A OUTPUT -j DROP
> COMMIT
> 
> 4) crm config:
> <crm_config>
>   <cluster_property_set id="cib-bootstrap-options">
>     <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
>     <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="stop"/>
>     <nvpair id="cib-bootstrap-options-stop-orphan-resources" name="stop-orphan-resources" value="true"/>
>     <nvpair id="cib-bootstrap-options-start-failure-is-fatal" name="start-failure-is-fatal" value="true"/>
>     <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="3"/>
>     <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.9-1512.el6-2a917dd"/>
>     <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
>   </cluster_property_set>
> </crm_config>
> 
> 
> 5) Example resource config:
>     <primitive class="ocf" id="dbx_ready_nodes" provider="dbxcl" type="ready.ocf.sh">
>       <instance_attributes id="dbx_ready_nodes-instance_attributes">
>         <nvpair id="dbx_ready_nodes-instance_attributes-dbxclrole" name="dbxclrole" value="''"/>
>       </instance_attributes>
>       <operations>
>         <op id="dbx_ready_nodes-start-timeout-1min-on-fail-stop" interval="0s" name="start" on-fail="stop" timeout="1min"/>
>         <op id="dbx_ready_nodes-stop-timeout-8min" interval="0s" name="stop" timeout="8min"/>
>         <op id="dbx_ready_nodes-monitor-interval-83s" interval="83s" name="monitor" on-fail="stop" timeout="60s"/>
>         <op id="dbx_ready_nodes-validate-all-interval-29s" interval="29s" name="validate-all" on-fail="stop" timeout="60s"/>
>       </operations>
>     </primitive>
> 
> 
> 6) Logs:
> Below a resource "dbx_ready_nodes" monitor action returns error, but nothing happens, the resource is not being requested to stop (even though it should, as can be seen above)
> 
> May 02 20:04:13 [16191] ip-10-116-169-85       lrmd:    debug: operation_finished:      dbx_ready_nodes_monitor_83000:8669 - exited with rc=1
> May 02 20:04:13 [16191] ip-10-116-169-85       lrmd:    debug: log_finished:    finished - rsc:dbx_ready_nodes action:monitor call_id:142 pid:8669 exit-code:1 exec-time:0ms queue-time:0ms
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync debug   [TOTEM ] sendmsg(mcast) failed (non-critical): Operation not permitted (1)
> May 02 20:04:13 [16154] ip-10-116-169-85 corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that th
> e local firewall is configured improperly.
> 
> 
> Thanks in advance
> 
> -- 
> Best Regards,
> 
> Radoslaw Garbacz
> XtremeData Incorporation
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140506/2fe1e646/attachment-0004.sig>