[Pacemaker] Network outage debugging

Wed Nov 13 18:42:07 UTC 2013

On Nov 13, 2013, at 10:24 AM, Jan Friesse <jfriesse at redhat.com> wrote:

> Sean Lutner napsal(a):
>> 
>> On Nov 13, 2013, at 3:15 AM, Jan Friesse <jfriesse at redhat.com> wrote:
>> 
>>> Andrew Beekhof napsal(a):
>>>> 
>>>> On 13 Nov 2013, at 11:49 am, Sean Lutner <sean at rentul.net> wrote:
>>>> 
>>>>> 
>>>>> 
>>>>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
>>>>>> <andrew at beekhof.net> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <sean at rentul.net>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
>>>>>>>> <andrew at beekhof.net> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <sean at rentul.net>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> The folks testing the cluster I've been building have run
>>>>>>>>> a script which blocks all traffic except SSH on one node
>>>>>>>>> of the cluster for 15 seconds to mimic a network failure.
>>>>>>>>> During this time, the network being "down" seems to cause
>>>>>>>>> some odd behavior from pacemaker resulting in it dying.
>>>>>>>>> 
>>>>>>>>> The cluster is two nodes and running four custom
>>>>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the
>>>>>>>>> config below:
>>>>>>>>> 
>>>>>>>>> I've attached the /var/log/messages and
>>>>>>>>> /var/log/cluster/corosync.log from the time period during
>>>>>>>>> the test. I've having some difficulty in piecing together
>>>>>>>>> what happened and am hoping someone can shed some light
>>>>>>>>> on the problem. Any indications why pacemaker is dying on
>>>>>>>>> that node?
>>>>>>>> 
>>>>>>>> Because corosync is dying underneath it:
>>>>>>>> 
>>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error:
>>>>>>>> send_ais_text:    Sending message 28 via cpg: FAILED
>>>>>>>> (rc=2): Library error: Connection timed out (110) Nov 09
>>>>>>>> 14:51:49 [942] ip-10-50-3-251        cib:    error:
>>>>>>>> pcmk_cpg_dispatch:    Connection to the CPG API failed: 2 
>>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error:
>>>>>>>> cib_ais_destroy:    Corosync connection lost!  Exiting. Nov
>>>>>>>> 09 14:51:49 [942] ip-10-50-3-251        cib:     info:
>>>>>>>> terminate_cib:    cib_ais_destroy: Exiting fast...
>>>>>>> 
>>>>>>> Is that the expected behavior?
>>>>>> 
>>>>>> It is expected behaviour when corosync dies.  Ideally corosync
>>>>>> wouldn't die though.
>>>>> 
>>>>> What other debugging can I do to try to find out why corosync
>>>>> died?
>>>> 
>>>> There are various logging setting that may help. CC'ing Jan to see
>>>> if he has any suggestions.
>>>> 
>>> 
>>> If corosync really died corosync-fplay output (right after corosync
>>> death) and coredump are most useful.
>>> 
>>> Regards,
>>> Honza
>> 
>> So the process to collect this would be:
>> 
>> - Run the test
>> - Watch the logs for corosync to die
> 
>> - Run corosync-fplay and capture the output (will corosync-fplay > file.out suffice?)
> 
> Yes. Usually, file is quite large, so gzip/xz is good idea.

Thanks, will do.

> 
>> - Capture a core dump from corosync 
>> 
>> How do I capture the core dump? Is it something that has to be enabled in the /etc/corosync/corosync.conf file first and then run the tests? I've not done this in the past.
> 
> This really depends. Do you have abrt enabled? If so, core is processed
> via abrt. (Way how to find out if abrt is running is to look to
> kernel.core_pattern sysctl. There is something different then classic
> value "core").

# sysctl -A |grep core_pattern
kernel.core_pattern = /var/tmp/%e-%t-%s.core

I looked in that directory and there are some core files, but nothing from the day this failure happened. I'm skeptical that one will be created if I run the test again. 

Is it accurate to say that whenever corosync dies in the manner seen in the logs, there should be a core file?

> 
> If you do not have abrt enabled, you must make sure to enable core
> dumps. When executing corosync via cman, it should be enabled
> automatically (start_global function does ulimit -c unlimited). If you
> are using corosync itself, create file /etc/default/corosync with
> content "ulimit -c unlimited".
> 
> Coredumps are stored in /var/lib/corosync/core.* (maybe you have already
> some of them there, so just take a look).
> 
> Now, please install corosynclib-devel package and use
> http://stackoverflow.com/questions/5115613/core-dump-file-analysis

Thanks, I'll install that package.

> 
> Important part is to execute bt (or even better, thread apply all bt)
> and send output from this command.
> 
> Regards,
>  Honza
> 
> 
>> Thanks
>> 
>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> 
>>>>>>> Is it because the DC was the other node?
>>>>>> 
>>>>>> No.
>>>>>> 
>>>>>>> 
>>>>>>> I did notice that there was an attempted fence operation but
>>>>>>> it didn't look successful.
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [root at ip-10-50-3-122 ~]# pcs config Corosync Nodes:
>>>>>>>>> 
>>>>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251
>>>>>>>>> 
>>>>>>>>> Resources: Resource: ClusterEIP_54.215.143.166
>>>>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes:
>>>>>>>>> first_network_interface_id=eni-e4e0b68c
>>>>>>>>> second_network_interface_id=eni-35f9af5d
>>>>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
>>>>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
>>>>>>>>> Operations: monitor interval=5s Clone:
>>>>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
>>>>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf) 
>>>>>>>>> Operations: monitor interval=5s Resource: Varnishlog
>>>>>>>>> (provider=redhat type=varnishlog.sh class=ocf) 
>>>>>>>>> Operations: monitor interval=5s Resource: Varnishncsa
>>>>>>>>> (provider=redhat type=varnishncsa.sh class=ocf) 
>>>>>>>>> Operations: monitor interval=5s Resource: ec2-fencing
>>>>>>>>> (type=fence_ec2 class=stonith) Attributes:
>>>>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
>>>>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor
>>>>>>>>> start-delay=30s interval=0 timeout=150s
>>>>>>>>> 
>>>>>>>>> Location Constraints: Ordering Constraints: 
>>>>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then
>>>>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation
>>>>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166 
>>>>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog
>>>>>>>>> 
>>>>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906 
>>>>>>>>> cluster-infrastructure: cman last-lrm-refresh:
>>>>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled:
>>>>>>>>> true
>>>>>>>>> 
>>>>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>>>>>>> 
>>>>>>>>> 
>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>> 
>>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> 
>>>>>>>> _______________________________________________ Pacemaker
>>>>>>>> mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> 
>>>>>>> _______________________________________________ Pacemaker
>>>>>>> mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>>>> http://bugs.clusterlabs.org
>>>>>> 
>>>>>> _______________________________________________ Pacemaker
>>>>>> mailing list: Pacemaker at oss.clusterlabs.org 
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>>> http://bugs.clusterlabs.org
>>>>> 
>>>>> _______________________________________________ Pacemaker mailing
>>>>> list: Pacemaker at oss.clusterlabs.org 
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>> http://bugs.clusterlabs.org
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131113/f235c318/attachment-0004.sig>