[Pacemaker] Network outage debugging

Wed Nov 13 15:24:09 UTC 2013

Sean Lutner napsal(a):
> 
> On Nov 13, 2013, at 3:15 AM, Jan Friesse <jfriesse at redhat.com> wrote:
> 
>> Andrew Beekhof napsal(a):
>>>
>>> On 13 Nov 2013, at 11:49 am, Sean Lutner <sean at rentul.net> wrote:
>>>
>>>>
>>>>
>>>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
>>>>> <andrew at beekhof.net> wrote:
>>>>>
>>>>>
>>>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <sean at rentul.net>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
>>>>>>> <andrew at beekhof.net> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <sean at rentul.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> The folks testing the cluster I've been building have run
>>>>>>>> a script which blocks all traffic except SSH on one node
>>>>>>>> of the cluster for 15 seconds to mimic a network failure.
>>>>>>>> During this time, the network being "down" seems to cause
>>>>>>>> some odd behavior from pacemaker resulting in it dying.
>>>>>>>>
>>>>>>>> The cluster is two nodes and running four custom
>>>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the
>>>>>>>> config below:
>>>>>>>>
>>>>>>>> I've attached the /var/log/messages and
>>>>>>>> /var/log/cluster/corosync.log from the time period during
>>>>>>>> the test. I've having some difficulty in piecing together
>>>>>>>> what happened and am hoping someone can shed some light
>>>>>>>> on the problem. Any indications why pacemaker is dying on
>>>>>>>> that node?
>>>>>>>
>>>>>>> Because corosync is dying underneath it:
>>>>>>>
>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error:
>>>>>>> send_ais_text:    Sending message 28 via cpg: FAILED
>>>>>>> (rc=2): Library error: Connection timed out (110) Nov 09
>>>>>>> 14:51:49 [942] ip-10-50-3-251        cib:    error:
>>>>>>> pcmk_cpg_dispatch:    Connection to the CPG API failed: 2 
>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error:
>>>>>>> cib_ais_destroy:    Corosync connection lost!  Exiting. Nov
>>>>>>> 09 14:51:49 [942] ip-10-50-3-251        cib:     info:
>>>>>>> terminate_cib:    cib_ais_destroy: Exiting fast...
>>>>>>
>>>>>> Is that the expected behavior?
>>>>>
>>>>> It is expected behaviour when corosync dies.  Ideally corosync
>>>>> wouldn't die though.
>>>>
>>>> What other debugging can I do to try to find out why corosync
>>>> died?
>>>
>>> There are various logging setting that may help. CC'ing Jan to see
>>> if he has any suggestions.
>>>
>>
>> If corosync really died corosync-fplay output (right after corosync
>> death) and coredump are most useful.
>>
>> Regards,
>>  Honza
> 
> So the process to collect this would be:
> 
> - Run the test
> - Watch the logs for corosync to die

> - Run corosync-fplay and capture the output (will corosync-fplay > file.out suffice?)

Yes. Usually, file is quite large, so gzip/xz is good idea.

> - Capture a core dump from corosync 
> 
> How do I capture the core dump? Is it something that has to be enabled in the /etc/corosync/corosync.conf file first and then run the tests? I've not done this in the past.

This really depends. Do you have abrt enabled? If so, core is processed
via abrt. (Way how to find out if abrt is running is to look to
kernel.core_pattern sysctl. There is something different then classic
value "core").

If you do not have abrt enabled, you must make sure to enable core
dumps. When executing corosync via cman, it should be enabled
automatically (start_global function does ulimit -c unlimited). If you
are using corosync itself, create file /etc/default/corosync with
content "ulimit -c unlimited".

Coredumps are stored in /var/lib/corosync/core.* (maybe you have already
some of them there, so just take a look).

Now, please install corosynclib-devel package and use
http://stackoverflow.com/questions/5115613/core-dump-file-analysis

Important part is to execute bt (or even better, thread apply all bt)
and send output from this command.

Regards,
  Honza

> Thanks
> 
>>
>>>>
>>>> Thanks
>>>>
>>>>>
>>>>>> Is it because the DC was the other node?
>>>>>
>>>>> No.
>>>>>
>>>>>>
>>>>>> I did notice that there was an attempted fence operation but
>>>>>> it didn't look successful.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [root at ip-10-50-3-122 ~]# pcs config Corosync Nodes:
>>>>>>>>
>>>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251
>>>>>>>>
>>>>>>>> Resources: Resource: ClusterEIP_54.215.143.166
>>>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes:
>>>>>>>> first_network_interface_id=eni-e4e0b68c
>>>>>>>> second_network_interface_id=eni-35f9af5d
>>>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
>>>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
>>>>>>>> Operations: monitor interval=5s Clone:
>>>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
>>>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf) 
>>>>>>>> Operations: monitor interval=5s Resource: Varnishlog
>>>>>>>> (provider=redhat type=varnishlog.sh class=ocf) 
>>>>>>>> Operations: monitor interval=5s Resource: Varnishncsa
>>>>>>>> (provider=redhat type=varnishncsa.sh class=ocf) 
>>>>>>>> Operations: monitor interval=5s Resource: ec2-fencing
>>>>>>>> (type=fence_ec2 class=stonith) Attributes:
>>>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
>>>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor
>>>>>>>> start-delay=30s interval=0 timeout=150s
>>>>>>>>
>>>>>>>> Location Constraints: Ordering Constraints: 
>>>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then
>>>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation
>>>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166 
>>>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog
>>>>>>>>
>>>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906 
>>>>>>>> cluster-infrastructure: cman last-lrm-refresh:
>>>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled:
>>>>>>>> true
>>>>>>>>
>>>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>>>>>>
>>>>>>>>
>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>> _______________________________________________ Pacemaker
>>>>>>> mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>> _______________________________________________ Pacemaker
>>>>>> mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>>> http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________ Pacemaker
>>>>> mailing list: Pacemaker at oss.clusterlabs.org 
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>> http://bugs.clusterlabs.org
>>>>
>>>> _______________________________________________ Pacemaker mailing
>>>> list: Pacemaker at oss.clusterlabs.org 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>> http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>