[Pacemaker] why pacemaker does not control the resources

Wed Nov 13 19:13:03 UTC 2013

13.11.2013, 03:22, "Andrew Beekhof" <andrew at beekhof.net>:
> On 12 Nov 2013, at 4:42 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  11.11.2013, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
>>>  On 8 Nov 2013, at 7:49 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>   Hi, PPL!
>>>>   I need help. I do not understand... Why has stopped working.
>>>>   This configuration work on other cluster, but on corosync1.
>>>>
>>>>   So... cluster postgres with master/slave.
>>>>   Classic config as in wiki.
>>>>   I build cluster, start, he is working.
>>>>   Next I kill postgres on Master with 6 signal, as if "disk space left"
>>>>
>>>>   # pkill -6 postgres
>>>>   # ps axuww|grep postgres
>>>>   root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep postgres
>>>>
>>>>   PostgreSQL die, But crm_mon shows that the master is still running.
>>>>
>>>>   Last updated: Fri Nov  8 00:42:08 2013
>>>>   Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on dev-cluster2-node4
>>>>   Stack: corosync
>>>>   Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>>   Version: 1.1.10-1.el6-368c726
>>>>   3 Nodes configured
>>>>   7 Resources configured
>>>>
>>>>   Node dev-cluster2-node2 (172793105): online
>>>>          pingCheck       (ocf::pacemaker:ping):  Started
>>>>          pgsql   (ocf::heartbeat:pgsql): Started
>>>>   Node dev-cluster2-node3 (172793106): online
>>>>          pingCheck       (ocf::pacemaker:ping):  Started
>>>>          pgsql   (ocf::heartbeat:pgsql): Started
>>>>   Node dev-cluster2-node4 (172793107): online
>>>>          pgsql   (ocf::heartbeat:pgsql): Master
>>>>          pingCheck       (ocf::pacemaker:ping):  Started
>>>>          VirtualIP       (ocf::heartbeat:IPaddr2):       Started
>>>>
>>>>   Node Attributes:
>>>>   * Node dev-cluster2-node2:
>>>>      + default_ping_set                  : 100
>>>>      + master-pgsql                      : -INFINITY
>>>>      + pgsql-data-status                 : STREAMING|ASYNC
>>>>      + pgsql-status                      : HS:async
>>>>   * Node dev-cluster2-node3:
>>>>      + default_ping_set                  : 100
>>>>      + master-pgsql                      : -INFINITY
>>>>      + pgsql-data-status                 : STREAMING|ASYNC
>>>>      + pgsql-status                      : HS:async
>>>>   * Node dev-cluster2-node4:
>>>>      + default_ping_set                  : 100
>>>>      + master-pgsql                      : 1000
>>>>      + pgsql-data-status                 : LATEST
>>>>      + pgsql-master-baseline             : 0000000002000078
>>>>      + pgsql-status                      : PRI
>>>>
>>>>   Migration summary:
>>>>   * Node dev-cluster2-node4:
>>>>   * Node dev-cluster2-node2:
>>>>   * Node dev-cluster2-node3:
>>>>
>>>>   Tickets:
>>>>
>>>>   CONFIG:
>>>>   node $id="172793105" dev-cluster2-node2. \
>>>>          attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>   node $id="172793106" dev-cluster2-node3. \
>>>>          attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>   node $id="172793107" dev-cluster2-node4. \
>>>>          attributes pgsql-data-status="LATEST"
>>>>   primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>>          params ip="10.76.157.194" \
>>>>          op start interval="0" timeout="60s" on-fail="stop" \
>>>>          op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>          op stop interval="0" timeout="60s" on-fail="block"
>>>>   primitive pgsql ocf:heartbeat:pgsql \
>>>>          params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" tmpdir="/tmp/pg" start_opt="-p 5432" logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " restore_command="gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="10.76.157.194" \
>>>>          op start interval="0" timeout="60s" on-fail="restart" \
>>>>          op monitor interval="5s" timeout="61s" on-fail="restart" \
>>>>          op monitor interval="1s" role="Master" timeout="62s" on-fail="restart" \
>>>>          op promote interval="0" timeout="63s" on-fail="restart" \
>>>>          op demote interval="0" timeout="64s" on-fail="stop" \
>>>>          op stop interval="0" timeout="65s" on-fail="block" \
>>>>          op notify interval="0" timeout="66s"
>>>>   primitive pingCheck ocf:pacemaker:ping \
>>>>          params name="default_ping_set" host_list="10.76.156.1" multiplier="100" \
>>>>          op start interval="0" timeout="60s" on-fail="restart" \
>>>>          op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>          op stop interval="0" timeout="60s" on-fail="ignore"
>>>>   ms msPostgresql pgsql \
>>>>          meta master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Master" clone-max="3"
>>>>   clone clnPingCheck pingCheck \
>>>>          meta clone-max="3"
>>>>   location l0_DontRunPgIfNotPingGW msPostgresql \
>>>>          rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
>>>>   colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
>>>>   colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
>>>>   order rsc_order-1 0: clnPingCheck msPostgresql
>>>>   order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
>>>>   order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
>>>>   property $id="cib-bootstrap-options" \
>>>>          dc-version="1.1.10-1.el6-368c726" \
>>>>          cluster-infrastructure="corosync" \
>>>>          stonith-enabled="false" \
>>>>          no-quorum-policy="stop"
>>>>   rsc_defaults $id="rsc-options" \
>>>>          resource-stickiness="INFINITY" \
>>>>          migration-threshold="1"
>>>>
>>>>   Tell me where to look - why does pacemaker not work?
>>>  You might want to follow some of the steps at:
>>>
>>>     http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
>>>
>>>  under the heading "Resource-level failures".
>>  Yes. Thank you.
>>  I've seen this article and now I study it in more detail.
>>  A lot of information in the logs, so it is difficult to determine where the error is, and where the consequence of error.
>>  Now I'm trying to figure it out.
>>
>>  BUT...
>>  While I can say with certainty that the RA with monitor in the MS(pgsql) is called ONLY on the node on which the last was launched PACEMAKER.
>
> It looks like you're hitting https://github.com/beekhof/pacemaker/commit/58962338
> Since you appear to be on rhel6 (or a clone of rhel6), can I suggest you use the 1.1.10 packages that come with 6.4?
> They include the above patch.

I already use (builded from source two weeks ago)
 * pacemaker 1.1.10
 * resource-agents 3.9.5
 * corosync 2.3.2
 * libqb 0.16
& CentOS 6.4

The same config work on pacemaker 1.1.9/corosync 1.4.5
Not ideal, but no such problem.

At first idea - I thought I should move the target-role=Master from MS to primitive pgsql.
And so even working.
But after a crash killing the main PostgreSQL process - started the same.
Today's experiments showed that this behavior starts after I add in MS "notify=true".
But primitive pgsql not properly work without "notify" messages.
While I in frustration :(

> Also, just to be sure. Are you expecting monitor operations to detect when you started a resource manually?
> If so, you'll need a monitor operation with role=Stopped. We don't do that by default.

I expect that the resource monitoring on all the time, otherwise how to control them?
Or I do not quite understand the question.

>>>  'crm_mon -o' might be a good source of information too.
>>  Therefore, I see that my resources allegedly functioning normally.
>>
>>  # crm_mon -o1
>>  Last updated: Tue Nov 12 09:27:16 2013
>>  Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2
>>  Stack: corosync
>>  Current DC: dev-cluster2-node2 (172793105) - partition with quorum
>>  Version: 1.1.10-1.el6-368c726
>>  3 Nodes configured
>>  337 Resources configured
>>
>>  Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
>>
>>  Clone Set: clonePing [pingCheck]
>>      Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
>>  Master/Slave Set: msPgsql [pgsql]
>>      Masters: [ dev-cluster2-node2 ]
>>      Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ]
>>  VirtualIP      (ocf::heartbeat:IPaddr2):       Started dev-cluster2-node2
>>
>>  Operations:
>>  * Node dev-cluster2-node2:
>>    pingCheck: migration-threshold=1
>>     + (20) start: rc=0 (ok)
>>     + (23) monitor: interval=10000ms rc=0 (ok)
>>    pgsql: migration-threshold=1
>>     + (41) promote: rc=0 (ok)
>>     + (87) monitor: interval=1000ms rc=8 (master)
>>    VirtualIP: migration-threshold=1
>>     + (49) start: rc=0 (ok)
>>     + (52) monitor: interval=10000ms rc=0 (ok)
>>  * Node dev-cluster2-node3:
>>    pingCheck: migration-threshold=1
>>     + (20) start: rc=0 (ok)
>>     + (23) monitor: interval=10000ms rc=0 (ok)
>>    pgsql: migration-threshold=1
>>     + (26) start: rc=0 (ok)
>>     + (32) monitor: interval=10000ms rc=0 (ok)
>>  * Node dev-cluster2-node4:
>>    pingCheck: migration-threshold=1
>>     + (20) start: rc=0 (ok)
>>     + (23) monitor: interval=10000ms rc=0 (ok)
>>    pgsql: migration-threshold=1
>>     + (26) start: rc=0 (ok)
>>     + (32) monitor: interval=10000ms rc=0 (ok)
>>
>>  In reality now killed (signal 4|6) the PG master and the penultimate slave PG.
>>  IMHO, even if I have something configured incorrectly, the inability to monitor the resource must cause a fatal error.
>>  Or is there a reason not to do so?
>>
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org