[Pacemaker] why pacemaker does not control the resources

Tue Nov 12 00:42:20 EST 2013

11.11.2013, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
> On 8 Nov 2013, at 7:49 am, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  Hi, PPL!
>>  I need help. I do not understand... Why has stopped working.
>>  This configuration work on other cluster, but on corosync1.
>>
>>  So... cluster postgres with master/slave.
>>  Classic config as in wiki.
>>  I build cluster, start, he is working.
>>  Next I kill postgres on Master with 6 signal, as if "disk space left"
>>
>>  # pkill -6 postgres
>>  # ps axuww|grep postgres
>>  root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep postgres
>>
>>  PostgreSQL die, But crm_mon shows that the master is still running.
>>
>>  Last updated: Fri Nov  8 00:42:08 2013
>>  Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on dev-cluster2-node4
>>  Stack: corosync
>>  Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>  Version: 1.1.10-1.el6-368c726
>>  3 Nodes configured
>>  7 Resources configured
>>
>>  Node dev-cluster2-node2 (172793105): online
>>         pingCheck       (ocf::pacemaker:ping):  Started
>>         pgsql   (ocf::heartbeat:pgsql): Started
>>  Node dev-cluster2-node3 (172793106): online
>>         pingCheck       (ocf::pacemaker:ping):  Started
>>         pgsql   (ocf::heartbeat:pgsql): Started
>>  Node dev-cluster2-node4 (172793107): online
>>         pgsql   (ocf::heartbeat:pgsql): Master
>>         pingCheck       (ocf::pacemaker:ping):  Started
>>         VirtualIP       (ocf::heartbeat:IPaddr2):       Started
>>
>>  Node Attributes:
>>  * Node dev-cluster2-node2:
>>     + default_ping_set                  : 100
>>     + master-pgsql                      : -INFINITY
>>     + pgsql-data-status                 : STREAMING|ASYNC
>>     + pgsql-status                      : HS:async
>>  * Node dev-cluster2-node3:
>>     + default_ping_set                  : 100
>>     + master-pgsql                      : -INFINITY
>>     + pgsql-data-status                 : STREAMING|ASYNC
>>     + pgsql-status                      : HS:async
>>  * Node dev-cluster2-node4:
>>     + default_ping_set                  : 100
>>     + master-pgsql                      : 1000
>>     + pgsql-data-status                 : LATEST
>>     + pgsql-master-baseline             : 0000000002000078
>>     + pgsql-status                      : PRI
>>
>>  Migration summary:
>>  * Node dev-cluster2-node4:
>>  * Node dev-cluster2-node2:
>>  * Node dev-cluster2-node3:
>>
>>  Tickets:
>>
>>  CONFIG:
>>  node $id="172793105" dev-cluster2-node2. \
>>         attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>  node $id="172793106" dev-cluster2-node3. \
>>         attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>  node $id="172793107" dev-cluster2-node4. \
>>         attributes pgsql-data-status="LATEST"
>>  primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>         params ip="10.76.157.194" \
>>         op start interval="0" timeout="60s" on-fail="stop" \
>>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>>         op stop interval="0" timeout="60s" on-fail="block"
>>  primitive pgsql ocf:heartbeat:pgsql \
>>         params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" tmpdir="/tmp/pg" start_opt="-p 5432" logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " restore_command="gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="10.76.157.194" \
>>         op start interval="0" timeout="60s" on-fail="restart" \
>>         op monitor interval="5s" timeout="61s" on-fail="restart" \
>>         op monitor interval="1s" role="Master" timeout="62s" on-fail="restart" \
>>         op promote interval="0" timeout="63s" on-fail="restart" \
>>         op demote interval="0" timeout="64s" on-fail="stop" \
>>         op stop interval="0" timeout="65s" on-fail="block" \
>>         op notify interval="0" timeout="66s"
>>  primitive pingCheck ocf:pacemaker:ping \
>>         params name="default_ping_set" host_list="10.76.156.1" multiplier="100" \
>>         op start interval="0" timeout="60s" on-fail="restart" \
>>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>>         op stop interval="0" timeout="60s" on-fail="ignore"
>>  ms msPostgresql pgsql \
>>         meta master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Master" clone-max="3"
>>  clone clnPingCheck pingCheck \
>>         meta clone-max="3"
>>  location l0_DontRunPgIfNotPingGW msPostgresql \
>>         rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
>>  colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
>>  colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
>>  order rsc_order-1 0: clnPingCheck msPostgresql
>>  order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
>>  order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
>>  property $id="cib-bootstrap-options" \
>>         dc-version="1.1.10-1.el6-368c726" \
>>         cluster-infrastructure="corosync" \
>>         stonith-enabled="false" \
>>         no-quorum-policy="stop"
>>  rsc_defaults $id="rsc-options" \
>>         resource-stickiness="INFINITY" \
>>         migration-threshold="1"
>>
>>  Tell me where to look - why does pacemaker not work?
>
> You might want to follow some of the steps at:
>
>    http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
>
> under the heading "Resource-level failures".

Yes. Thank you. 
I've seen this article and now I study it in more detail.
A lot of information in the logs, so it is difficult to determine where the error is, and where the consequence of error.
Now I'm trying to figure it out.

BUT...
While I can say with certainty that the RA with monitor in the MS(pgsql) is called ONLY on the node on which the last was launched PACEMAKER.
>
> 'crm_mon -o' might be a good source of information too.
Therefore, I see that my resources allegedly functioning normally.

# crm_mon -o1 
Last updated: Tue Nov 12 09:27:16 2013
Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2
Stack: corosync
Current DC: dev-cluster2-node2 (172793105) - partition with quorum
Version: 1.1.10-1.el6-368c726
3 Nodes configured
337 Resources configured

Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]

 Clone Set: clonePing [pingCheck]
     Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
 Master/Slave Set: msPgsql [pgsql]
     Masters: [ dev-cluster2-node2 ]
     Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ]
 VirtualIP      (ocf::heartbeat:IPaddr2):       Started dev-cluster2-node2

Operations:
* Node dev-cluster2-node2:
   pingCheck: migration-threshold=1
    + (20) start: rc=0 (ok)
    + (23) monitor: interval=10000ms rc=0 (ok)
   pgsql: migration-threshold=1
    + (41) promote: rc=0 (ok)
    + (87) monitor: interval=1000ms rc=8 (master)
   VirtualIP: migration-threshold=1
    + (49) start: rc=0 (ok)
    + (52) monitor: interval=10000ms rc=0 (ok)
* Node dev-cluster2-node3:
   pingCheck: migration-threshold=1
    + (20) start: rc=0 (ok)
    + (23) monitor: interval=10000ms rc=0 (ok)
   pgsql: migration-threshold=1
    + (26) start: rc=0 (ok)
    + (32) monitor: interval=10000ms rc=0 (ok)
* Node dev-cluster2-node4:
   pingCheck: migration-threshold=1
    + (20) start: rc=0 (ok)
    + (23) monitor: interval=10000ms rc=0 (ok)
   pgsql: migration-threshold=1
    + (26) start: rc=0 (ok)
    + (32) monitor: interval=10000ms rc=0 (ok)

In reality now killed (signal 4|6) the PG master and the penultimate slave PG.
IMHO, even if I have something configured incorrectly, the inability to monitor the resource must cause a fatal error.
Or is there a reason not to do so?