[Pacemaker] faq / howto needed for cib troubleshooting

Fri Dec 9 00:28:59 CET 2011

On Fri, Nov 25, 2011 at 8:44 AM, Attila Megyeri
<amegyeri at minerva-soft.com> wrote:
> Hi Gents,
>
> I see from time to time that you are asking for "cibadmin -Ql" type outputs to help people troubleshoot their problems.
>
> Currenty I have an issue promoting a MS resource (the PSQL issue in the previous mail) - and I would like to start troubleshooting the problem, but did not find any howto's or documentation on this topic.
> Could you  provide me any details on how to troubleshoot cib states?

Start with crm_mon -o
Then check what crm_simulate -L says.
Try adding additional -V arguments and grepping for your resource name.

> My current issue is that I have a MS resource that is started in slave/slave mode, and the "promote" is never even called by the cib. I'd like to start the research but have no idea how to do it.

Are you sure the promote doesnt happen?  No mention of it in the logs?

>
> I have read the pacemaker doc, as well as the cluster from srcatch doc, but there are no troubleshooting hints.
>
> Thank you in advance,
>
> Attila
>
> -----Original Message-----
> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> Sent: 2011. november 23. 16:53
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
>
> Hi Takatoshi, All,
>
> Thanks for your reply.
> I see that you have invested significant effort in the development of the RA. I spent the last day trying to set up the RA, but without much success.
>
> My infrastructure is very similar to yours, except for the fact that currently I am testing with a single network adapter.
>
> Replication works nicely when I start the databases manually, not using corosync.
>
> When I try to start using corosync,I see that the ping resources start normally, but the msPostgresql starts on both nodes in slave mode, and I see "HS:alone"
>
> In the Wiki you state, the if I start on a signle node only, PSQL should start in Master mode (PRI), but this is not the case.
>
> The recovery.conf file is created immediately, and from the logs I see no attempt at all to promote the node.
> In the postgres logs I see that node1, which is supposed to be a master, tries to connect to the vip-rep IP address, which is NOT brought up, because it depends on the Master role...
>
> Do you have any idea?
>
>
> My environment:
> Debian Squeeze, with backported pacemaker (Version: 1.1.5) - official pacemaker in debian is rather old and buggy Postgres 9.1, streaming replication, sync mode
> Node1: psql1, 10.12.1.21
> Node1: psql2, 10.12.1.22
>
> Crm config:
>
> node psql1 \
>        attributes standby="off"
> node psql2 \
>        attributes standby="off"
> primitive pingCheck ocf:pacemaker:ping \
>        params name="default_ping_set" host_list="10.12.1.1" multiplier="100" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="ignore"
> primitive postgresql ocf:heartbeat:pgsql \
>        params pgctl="/usr/lib/postgresql/9.1/bin/pg_ctl" psql="/usr/bin/psql" pgdata="/var/lib/postgresql/9.1/main" config="/etc/postgresql/9.1/main/postgresql.conf" pgctldata="/usr/lib/postgresql/9.1/bin/pg_controldata" rep_mode="sync" node_list="psql1 psql2" restore_command="cp /var/lib/postgresql/9.1/main/pg_archive/%f %p" master_ip="10.12.1.28" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="7s" timeout="60s" on-fail="restart" \
>        op monitor interval="2s" role="Master" timeout="60s" on-fail="restart" \
>        op promote interval="0s" timeout="60s" on-fail="restart" \
>        op demote interval="0s" timeout="60s" on-fail="block" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        op notify interval="0s" timeout="60s"
> primitive vip-master ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.20" nic="eth0" cidr_netmask="24" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        meta target-role="Started"
> primitive vip-rep ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.28" nic="eth0" cidr_netmask="24" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block" \
>        meta target-role="Started"
> primitive vip-slave ocf:heartbeat:IPaddr2 \
>        params ip="10.12.1.27" nic="eth0" cidr_netmask="24" \
>        meta resource-stickiness="1" \
>        op start interval="0s" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0s" timeout="60s" on-fail="block"
> group master-group vip-master vip-rep
> ms msPostgresql postgresql \
>        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master"
> clone clnPingCheck pingCheck
> location rsc_location-1 vip-slave \
>        rule $id="rsc_location-1-rule" 200: pgsql-status eq HS:sync \
>        rule $id="rsc_location-1-rule-0" 100: pgsql-status eq PRI \
>        rule $id="rsc_location-1-rule-1" -inf: not_defined pgsql-status \
>        rule $id="rsc_location-1-rule-2" -inf: pgsql-status ne HS:sync and pgsql-status ne PRI location rsc_location-2 msPostgresql \
>        rule $id="rsc_location-2-rule" $role="master" 200: #uname eq psql1 \
>        rule $id="rsc_location-2-rule-0" $role="master" 100: #uname eq psql2 \
>        rule $id="rsc_location-2-rule-1" $role="master" -inf: defined fail-count-vip-master \
>        rule $id="rsc_location-2-rule-2" $role="master" -inf: defined fail-count-vip-rep \
>        rule $id="rsc_location-2-rule-3" -inf: not_defined default_ping_set or default_ping_set lt 100 colocation rsc_colocation-1 inf: msPostgresql clnPingCheck colocation rsc_colocation-2 inf: master-group msPostgresql:Master order rsc_order-1 0: clnPingCheck msPostgresql order rsc_order-2 0: msPostgresql:promote master-group:start symmetrical=false order rsc_order-3 0: msPostgresql:demote master-group:stop symmetrical=false property $id="cib-bootstrap-options" \
>        dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        stonith-enabled="false" \
>        no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="INFINITY" \
>        migration-threshold="1"
>
>
>
> Regards,
> Attila
>
>
>
> -----Original Message-----
> From: Takatoshi MATSUO [mailto:matsuo.tak at gmail.com]
> Sent: 2011. november 17. 8:04
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
>
> Hi  All
>
> I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql.
>
> RA
>  https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
> Documents
>  https://github.com/t-matsuo/resource-agents/wiki
>
> It is almost totally changed from previous patch http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html
> .
> It create recovery.conf and promote PostgreSQL automatically.
> Additionally it can switch between the synchronous and asynchronous replication automatically.
>
> If you please, use them and comment.
>
> Regards,
> Takatoshi MATSUO
>
> 2011/11/17 Serge Dubrouski <sergeyfd at gmail.com>:
>>
>>
>> On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri
>> <amegyeri at minerva-soft.com>
>> wrote:
>>>
>>> Hi Florian,
>>>
>>> -----Original Message-----
>>> From: Florian Haas [mailto:florian at hastexo.com]
>>> Sent: 2011. november 16. 11:49
>>> To: The Pacemaker cluster resource manager
>>> Subject: Re: [Pacemaker] Postgresql streaming replication failover -
>>> RA needed
>>>
>>> Hi Attila,
>>>
>>> On 2011-11-16 10:27, Attila Megyeri wrote:
>>> > Hi All,
>>> >
>>> >
>>> >
>>> > We have a two-node postgresql 9.1 system configured using streaming
>>> > replicaiton(active/active with a read-only slave).
>>> >
>>> > We want to automate the failover process and I couldn't really find
>>> > a resource agent that could do the job.
>>>
>>> That is correct; the pgsql resource agent (unlike its mysql
>>> counterpart) does not support streaming replication. We've had a
>>> contributor submit a patch at one point, but it was somewhat
>>> ill-conceived and thus did not make it into the upstream repo. The relevant thread is here:
>>>
>>> http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195
>>> .html
>>>
>>> Would you feel comfortable modifying the pgsql resource agent to
>>> support replication? If so, we could revisit this issue and
>>> potentially add streaming replication support to pgsql.
>>>
>>>
>>> Well I'm not sure I would be able to do that change. Failover is
>>> relatively easy to do but I really have no idea how to do the failback part.
>>
>> And that's exactly the reason why I haven't implemented it yet. With
>> the current way how replication is done in PostgreSQL there is no easy
>> way to switch between roles, or at least I don't know about a such way.
>> Implementing just fail-over functionality by creating a trigger file
>> on a slave server in the case of failure on master side doesn't create
>> a full master-slave implementation in my opinion.
>>
>>>
>>> I will definitively have to sort this out somehow, I am just unsure
>>> whether I will try to use the repmgr mentioned in the video, or
>>> pacemaker with some level of customization...
>>>
>>> Is the resource agent that you mentioned available somewhere?
>>>
>>> Thanks.
>>> Attila
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs:
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem
>>> aker
>>
>>
>>
>> --
>> Serge Dubrouski.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema
>> ker
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org