[Pacemaker] Network outage debugging
Jan Friesse
jfriesse at redhat.com
Wed Nov 13 08:15:31 UTC 2013
Andrew Beekhof napsal(a):
>
> On 13 Nov 2013, at 11:49 am, Sean Lutner <sean at rentul.net> wrote:
>
>>
>>
>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
>>> <andrew at beekhof.net> wrote:
>>>
>>>
>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <sean at rentul.net>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
>>>>> <andrew at beekhof.net> wrote:
>>>>>
>>>>>
>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <sean at rentul.net>
>>>>>> wrote:
>>>>>>
>>>>>> The folks testing the cluster I've been building have run
>>>>>> a script which blocks all traffic except SSH on one node
>>>>>> of the cluster for 15 seconds to mimic a network failure.
>>>>>> During this time, the network being "down" seems to cause
>>>>>> some odd behavior from pacemaker resulting in it dying.
>>>>>>
>>>>>> The cluster is two nodes and running four custom
>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the
>>>>>> config below:
>>>>>>
>>>>>> I've attached the /var/log/messages and
>>>>>> /var/log/cluster/corosync.log from the time period during
>>>>>> the test. I've having some difficulty in piecing together
>>>>>> what happened and am hoping someone can shed some light
>>>>>> on the problem. Any indications why pacemaker is dying on
>>>>>> that node?
>>>>>
>>>>> Because corosync is dying underneath it:
>>>>>
>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>> send_ais_text: Sending message 28 via cpg: FAILED
>>>>> (rc=2): Library error: Connection timed out (110) Nov 09
>>>>> 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2
>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>> cib_ais_destroy: Corosync connection lost! Exiting. Nov
>>>>> 09 14:51:49 [942] ip-10-50-3-251 cib: info:
>>>>> terminate_cib: cib_ais_destroy: Exiting fast...
>>>>
>>>> Is that the expected behavior?
>>>
>>> It is expected behaviour when corosync dies. Ideally corosync
>>> wouldn't die though.
>>
>> What other debugging can I do to try to find out why corosync
>> died?
>
> There are various logging setting that may help. CC'ing Jan to see
> if he has any suggestions.
>
If corosync really died corosync-fplay output (right after corosync
death) and coredump are most useful.
Regards,
Honza
>>
>> Thanks
>>
>>>
>>>> Is it because the DC was the other node?
>>>
>>> No.
>>>
>>>>
>>>> I did notice that there was an attempted fence operation but
>>>> it didn't look successful.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> [root at ip-10-50-3-122 ~]# pcs config Corosync Nodes:
>>>>>>
>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251
>>>>>>
>>>>>> Resources: Resource: ClusterEIP_54.215.143.166
>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes:
>>>>>> first_network_interface_id=eni-e4e0b68c
>>>>>> second_network_interface_id=eni-35f9af5d
>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
>>>>>> Operations: monitor interval=5s Clone:
>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf)
>>>>>> Operations: monitor interval=5s Resource: Varnishlog
>>>>>> (provider=redhat type=varnishlog.sh class=ocf)
>>>>>> Operations: monitor interval=5s Resource: Varnishncsa
>>>>>> (provider=redhat type=varnishncsa.sh class=ocf)
>>>>>> Operations: monitor interval=5s Resource: ec2-fencing
>>>>>> (type=fence_ec2 class=stonith) Attributes:
>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor
>>>>>> start-delay=30s interval=0 timeout=150s
>>>>>>
>>>>>> Location Constraints: Ordering Constraints:
>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then
>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation
>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166
>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog
>>>>>>
>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906
>>>>>> cluster-infrastructure: cman last-lrm-refresh:
>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled:
>>>>>> true
>>>>>>
>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>>>>
>>>>>>
_______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________ Pacemaker
>>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>> _______________________________________________ Pacemaker
>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>> http://bugs.clusterlabs.org
>>>
>>> _______________________________________________ Pacemaker
>>> mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>> http://bugs.clusterlabs.org
>>
>> _______________________________________________ Pacemaker mailing
>> list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>> http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list