[Pacemaker] Network outage debugging
Sean Lutner
sean at rentul.net
Wed Nov 13 14:24:46 UTC 2013
On Nov 13, 2013, at 3:15 AM, Jan Friesse <jfriesse at redhat.com> wrote:
> Andrew Beekhof napsal(a):
>>
>> On 13 Nov 2013, at 11:49 am, Sean Lutner <sean at rentul.net> wrote:
>>
>>>
>>>
>>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
>>>> <andrew at beekhof.net> wrote:
>>>>
>>>>
>>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <sean at rentul.net>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
>>>>>> <andrew at beekhof.net> wrote:
>>>>>>
>>>>>>
>>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <sean at rentul.net>
>>>>>>> wrote:
>>>>>>>
>>>>>>> The folks testing the cluster I've been building have run
>>>>>>> a script which blocks all traffic except SSH on one node
>>>>>>> of the cluster for 15 seconds to mimic a network failure.
>>>>>>> During this time, the network being "down" seems to cause
>>>>>>> some odd behavior from pacemaker resulting in it dying.
>>>>>>>
>>>>>>> The cluster is two nodes and running four custom
>>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the
>>>>>>> config below:
>>>>>>>
>>>>>>> I've attached the /var/log/messages and
>>>>>>> /var/log/cluster/corosync.log from the time period during
>>>>>>> the test. I've having some difficulty in piecing together
>>>>>>> what happened and am hoping someone can shed some light
>>>>>>> on the problem. Any indications why pacemaker is dying on
>>>>>>> that node?
>>>>>>
>>>>>> Because corosync is dying underneath it:
>>>>>>
>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>>> send_ais_text: Sending message 28 via cpg: FAILED
>>>>>> (rc=2): Library error: Connection timed out (110) Nov 09
>>>>>> 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2
>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error:
>>>>>> cib_ais_destroy: Corosync connection lost! Exiting. Nov
>>>>>> 09 14:51:49 [942] ip-10-50-3-251 cib: info:
>>>>>> terminate_cib: cib_ais_destroy: Exiting fast...
>>>>>
>>>>> Is that the expected behavior?
>>>>
>>>> It is expected behaviour when corosync dies. Ideally corosync
>>>> wouldn't die though.
>>>
>>> What other debugging can I do to try to find out why corosync
>>> died?
>>
>> There are various logging setting that may help. CC'ing Jan to see
>> if he has any suggestions.
>>
>
> If corosync really died corosync-fplay output (right after corosync
> death) and coredump are most useful.
>
> Regards,
> Honza
So the process to collect this would be:
- Run the test
- Watch the logs for corosync to die
- Run corosync-fplay and capture the output (will corosync-fplay > file.out suffice?)
- Capture a core dump from corosync
How do I capture the core dump? Is it something that has to be enabled in the /etc/corosync/corosync.conf file first and then run the tests? I've not done this in the past.
Thanks
>
>>>
>>> Thanks
>>>
>>>>
>>>>> Is it because the DC was the other node?
>>>>
>>>> No.
>>>>
>>>>>
>>>>> I did notice that there was an attempted fence operation but
>>>>> it didn't look successful.
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [root at ip-10-50-3-122 ~]# pcs config Corosync Nodes:
>>>>>>>
>>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251
>>>>>>>
>>>>>>> Resources: Resource: ClusterEIP_54.215.143.166
>>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes:
>>>>>>> first_network_interface_id=eni-e4e0b68c
>>>>>>> second_network_interface_id=eni-35f9af5d
>>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
>>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
>>>>>>> Operations: monitor interval=5s Clone:
>>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
>>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf)
>>>>>>> Operations: monitor interval=5s Resource: Varnishlog
>>>>>>> (provider=redhat type=varnishlog.sh class=ocf)
>>>>>>> Operations: monitor interval=5s Resource: Varnishncsa
>>>>>>> (provider=redhat type=varnishncsa.sh class=ocf)
>>>>>>> Operations: monitor interval=5s Resource: ec2-fencing
>>>>>>> (type=fence_ec2 class=stonith) Attributes:
>>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
>>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor
>>>>>>> start-delay=30s interval=0 timeout=150s
>>>>>>>
>>>>>>> Location Constraints: Ordering Constraints:
>>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then
>>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation
>>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166
>>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog
>>>>>>>
>>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906
>>>>>>> cluster-infrastructure: cman last-lrm-refresh:
>>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled:
>>>>>>> true
>>>>>>>
>>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>>>>>
>>>>>>>
> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>> _______________________________________________ Pacemaker
>>>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>> _______________________________________________ Pacemaker
>>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>>> http://bugs.clusterlabs.org
>>>>
>>>> _______________________________________________ Pacemaker
>>>> mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>>> http://bugs.clusterlabs.org
>>>
>>> _______________________________________________ Pacemaker mailing
>>> list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
>>> http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131113/4ea2d4bc/attachment-0004.sig>
More information about the Pacemaker
mailing list