[Pacemaker] Network outage debugging
Sean Lutner
sean at rentul.net
Wed Nov 13 18:42:07 UTC 2013
On Nov 13, 2013, at 10:24 AM, Jan Friesse <jfriesse at redhat.com> wrote:
> Sean Lutner napsal(a):
>>
>> On Nov 13, 2013, at 3:15 AM, Jan Friesse <jfriesse at redhat.com> wrote:
>>
>>> Andrew Beekhof napsal(a):
>>>>
>>>> On 13 Nov 2013, at 11:49 am, Sean Lutner <sean at rentul.net> wrote:
>>>>
>>>>>
>>>>>
>>>>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
>>>>>> <andrew at beekhof.net> wrote:
>>>>>>
>>>>>>
>>>>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <sean at rentul.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
>>>>>>>> <andrew at beekhof.net> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <sean at rentul.net>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> The folks testing the cluster I've been building have run
>>>>>>>>> a script which blocks all traffic except SSH on one node
>>>>>>>>> of the cluster for 15 seconds to mimic a network failure.
>>>>>>>>> During this time, the network being "down" seems to cause
>>>>>>>>> some odd behavior from pacemaker resulting in it dying.
>>>>>>>>>
>>>>>>>>> The cluster is two nodes and running four custom
>>>>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the
>>>>>>>>> config below:
>>>>>>>>>
>>>>>>>>> I've attached the /var/log/messages and
>>>>>>>>> /var/log/cluster/corosync.log from the time period during
>>>>>>>>> the test. I've having some difficulty in piecing together
>>>>>>>>> what happened and am hoping someone can shed some light
>>>>>>>>> on the problem. Any indications why pacemaker is dying on
>>>>>>>>> that node?
>>>>>>>>
>>>>>>>> Because corosync is dying underneath it:
>>>>>>>>
>>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>>>>> send_ais_text: Sending message 28 via cpg: FAILED
>>>>>>>> (rc=2): Library error: Connection timed out (110) Nov 09
>>>>>>>> 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2
>>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>>>>> cib_ais_destroy: Corosync connection lost! Exiting. Nov
>>>>>>>> 09 14:51:49 [942] ip-10-50-3-251 cib: info:
>>>>>>>> terminate_cib: cib_ais_destroy: Exiting fast...
>>>>>>>
>>>>>>> Is that the expected behavior?
>>>>>>
>>>>>> It is expected behaviour when corosync dies. Ideally corosync
>>>>>> wouldn't die though.
>>>>>
>>>>> What other debugging can I do to try to find out why corosync
>>>>> died?
>>>>
>>>> There are various logging setting that may help. CC'ing Jan to see
>>>> if he has any suggestions.
>>>>
>>>
>>> If corosync really died corosync-fplay output (right after corosync
>>> death) and coredump are most useful.
>>>
>>> Regards,
>>> Honza
>>
>> So the process to collect this would be:
>>
>> - Run the test
>> - Watch the logs for corosync to die
>
>> - Run corosync-fplay and capture the output (will corosync-fplay > file.out suffice?)
>
> Yes. Usually, file is quite large, so gzip/xz is good idea.
Thanks, will do.
>
>> - Capture a core dump from corosync
>>
>> How do I capture the core dump? Is it something that has to be enabled in the /etc/corosync/corosync.conf file first and then run the tests? I've not done this in the past.
>
> This really depends. Do you have abrt enabled? If so, core is processed
> via abrt. (Way how to find out if abrt is running is to look to
> kernel.core_pattern sysctl. There is something different then classic
> value "core").
# sysctl -A |grep core_pattern
kernel.core_pattern = /var/tmp/%e-%t-%s.core
I looked in that directory and there are some core files, but nothing from the day this failure happened. I'm skeptical that one will be created if I run the test again.
Is it accurate to say that whenever corosync dies in the manner seen in the logs, there should be a core file?
>
> If you do not have abrt enabled, you must make sure to enable core
> dumps. When executing corosync via cman, it should be enabled
> automatically (start_global function does ulimit -c unlimited). If you
> are using corosync itself, create file /etc/default/corosync with
> content "ulimit -c unlimited".
>
> Coredumps are stored in /var/lib/corosync/core.* (maybe you have already
> some of them there, so just take a look).
>
> Now, please install corosynclib-devel package and use
> http://stackoverflow.com/questions/5115613/core-dump-file-analysis
Thanks, I'll install that package.
>
> Important part is to execute bt (or even better, thread apply all bt)
> and send output from this command.
>
> Regards,
> Honza
>
>
>> Thanks
>>
>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>>> Is it because the DC was the other node?
>>>>>>
>>>>>> No.
>>>>>>
>>>>>>>
>>>>>>> I did notice that there was an attempted fence operation but
>>>>>>> it didn't look successful.
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [root at ip-10-50-3-122 ~]# pcs config Corosync Nodes:
>>>>>>>>>
>>>>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251
>>>>>>>>>
>>>>>>>>> Resources: Resource: ClusterEIP_54.215.143.166
>>>>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes:
>>>>>>>>> first_network_interface_id=eni-e4e0b68c
>>>>>>>>> second_network_interface_id=eni-35f9af5d
>>>>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
>>>>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
>>>>>>>>> Operations: monitor interval=5s Clone:
>>>>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
>>>>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf)
>>>>>>>>> Operations: monitor interval=5s Resource: Varnishlog
>>>>>>>>> (provider=redhat type=varnishlog.sh class=ocf)
>>>>>>>>> Operations: monitor interval=5s Resource: Varnishncsa
>>>>>>>>> (provider=redhat type=varnishncsa.sh class=ocf)
>>>>>>>>> Operations: monitor interval=5s Resource: ec2-fencing
>>>>>>>>> (type=fence_ec2 class=stonith) Attributes:
>>>>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
>>>>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor
>>>>>>>>> start-delay=30s interval=0 timeout=150s
>>>>>>>>>
>>>>>>>>> Location Constraints: Ordering Constraints:
>>>>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then
>>>>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation
>>>>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166
>>>>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog
>>>>>>>>>
>>>>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906
>>>>>>>>> cluster-infrastructure: cman last-lrm-refresh:
>>>>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled:
>>>>>>>>> true
>>>>>>>>>
>>>>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>>>>>>>
>>>>>>>>>
>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>
>>>>>>>> _______________________________________________ Pacemaker
>>>>>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>> _______________________________________________ Pacemaker
>>>>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>>>> http://bugs.clusterlabs.org
>>>>>>
>>>>>> _______________________________________________ Pacemaker
>>>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>>> http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________ Pacemaker mailing
>>>>> list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>> http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131113/f235c318/attachment-0004.sig>
More information about the Pacemaker
mailing list