[Pacemaker] why pacemaker does not control the resources

Sun Nov 10 18:41:36 EST 2013

On 8 Nov 2013, at 7:49 am, Andrey Groshev <greenx at yandex.ru> wrote:

> Hi, PPL!
> I need help. I do not understand... Why has stopped working.
> This configuration work on other cluster, but on corosync1.
> 
> So... cluster postgres with master/slave.
> Classic config as in wiki.
> I build cluster, start, he is working.
> Next I kill postgres on Master with 6 signal, as if "disk space left"
> 
> # pkill -6 postgres
> # ps axuww|grep postgres
> root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep postgres 
> 
> PostgreSQL die, But crm_mon shows that the master is still running.
> 
> Last updated: Fri Nov  8 00:42:08 2013
> Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on dev-cluster2-node4
> Stack: corosync
> Current DC: dev-cluster2-node4 (172793107) - partition with quorum
> Version: 1.1.10-1.el6-368c726
> 3 Nodes configured
> 7 Resources configured
> 
> 
> Node dev-cluster2-node2 (172793105): online
>        pingCheck       (ocf::pacemaker:ping):  Started
>        pgsql   (ocf::heartbeat:pgsql): Started
> Node dev-cluster2-node3 (172793106): online
>        pingCheck       (ocf::pacemaker:ping):  Started
>        pgsql   (ocf::heartbeat:pgsql): Started
> Node dev-cluster2-node4 (172793107): online
>        pgsql   (ocf::heartbeat:pgsql): Master
>        pingCheck       (ocf::pacemaker:ping):  Started
>        VirtualIP       (ocf::heartbeat:IPaddr2):       Started
> 
> Node Attributes:
> * Node dev-cluster2-node2:
>    + default_ping_set                  : 100
>    + master-pgsql                      : -INFINITY 
>    + pgsql-data-status                 : STREAMING|ASYNC
>    + pgsql-status                      : HS:async  
> * Node dev-cluster2-node3:
>    + default_ping_set                  : 100
>    + master-pgsql                      : -INFINITY 
>    + pgsql-data-status                 : STREAMING|ASYNC
>    + pgsql-status                      : HS:async  
> * Node dev-cluster2-node4:
>    + default_ping_set                  : 100
>    + master-pgsql                      : 1000
>    + pgsql-data-status                 : LATEST    
>    + pgsql-master-baseline             : 0000000002000078
>    + pgsql-status                      : PRI
> 
> Migration summary:
> * Node dev-cluster2-node4: 
> * Node dev-cluster2-node2: 
> * Node dev-cluster2-node3: 
> 
> Tickets:
> 
> CONFIG:
> node $id="172793105" dev-cluster2-node2. \
>        attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
> node $id="172793106" dev-cluster2-node3. \
>        attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
> node $id="172793107" dev-cluster2-node4. \
>        attributes pgsql-data-status="LATEST"
> primitive VirtualIP ocf:heartbeat:IPaddr2 \
>        params ip="10.76.157.194" \
>        op start interval="0" timeout="60s" on-fail="stop" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0" timeout="60s" on-fail="block"
> primitive pgsql ocf:heartbeat:pgsql \
>        params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" tmpdir="/tmp/pg" start_opt="-p 5432" logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " restore_command="gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="10.76.157.194" \
>        op start interval="0" timeout="60s" on-fail="restart" \
>        op monitor interval="5s" timeout="61s" on-fail="restart" \
>        op monitor interval="1s" role="Master" timeout="62s" on-fail="restart" \
>        op promote interval="0" timeout="63s" on-fail="restart" \
>        op demote interval="0" timeout="64s" on-fail="stop" \
>        op stop interval="0" timeout="65s" on-fail="block" \
>        op notify interval="0" timeout="66s"
> primitive pingCheck ocf:pacemaker:ping \
>        params name="default_ping_set" host_list="10.76.156.1" multiplier="100" \
>        op start interval="0" timeout="60s" on-fail="restart" \
>        op monitor interval="10s" timeout="60s" on-fail="restart" \
>        op stop interval="0" timeout="60s" on-fail="ignore"
> ms msPostgresql pgsql \
>        meta master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Master" clone-max="3"
> clone clnPingCheck pingCheck \
>        meta clone-max="3"
> location l0_DontRunPgIfNotPingGW msPostgresql \
>        rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
> colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
> colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
> order rsc_order-1 0: clnPingCheck msPostgresql
> order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
> order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
> property $id="cib-bootstrap-options" \
>        dc-version="1.1.10-1.el6-368c726" \
>        cluster-infrastructure="corosync" \
>        stonith-enabled="false" \
>        no-quorum-policy="stop"
> rsc_defaults $id="rsc-options" \
>        resource-stickiness="INFINITY" \
>        migration-threshold="1"
> 
> 
> 
> 
> Tell me where to look - why does pacemaker not work?

You might want to follow some of the steps at:

   http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/

under the heading "Resource-level failures".

'crm_mon -o' might be a good source of information too.