[ClusterLabs] Trouble starting up PAF cluster for first time

Fri Apr 6 11:23:01 EDT 2018

Please forgive me if this message is a duplicate - I sent it yesterday but it is not showing up on them mailing list or in the archives, so I'm trying a second time...

I'm using this resource agent:  http://clusterlabs.github.io/PAF

I'm trying to set up a 3-node cluster.

I install PostgreSQL on all 3 nodes, manually bring up the VIP on the master node and bring up the 2 standby's which connect to the master through that VIP.  I can confirm that everything is working correctly at this point.

Then, in preparation for setting up Pacemaker, I stop PostgreSQL on all nodes, and remove the VIP from what was the master server.

At this point I attempt to initialize the cluster using the following PCS steps:

------
pcs cluster auth node1 node2 node3 -u hacluster
pcs cluster setup --name test --force node1 node2 node3
pcs cluster start --all
pcs cluster cib cluster1.xml
pcs -f cluster1.xml property set stonith-enabled=false
pcs -f cluster1.xml resource defaults migration-threshold=5
pcs -f cluster1.xml resource defaults resource-stickiness=10
pcs -f cluster1.xml resource create postgresql-master-vip ocf:heartbeat:IPaddr2 ip=10.124.167.176 cidr_netmask=22 op monitor interval=10s
pcs -f cluster1.xml resource create postgresql-10-main ocf:heartbeat:pgsqlms bindir="/usr/lib/postgresql/10/bin" pgdata="/var/lib/postgresql/10/main" pgport=5432 recovery_template="/etc/postgresql/10/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" op start timeout=60s op stop timeout=60s op promote timeout=30s op demote timeout=120s op monitor interval=15s timeout=10s role="Master" op monitor interval=16s timeout=10s role="Slave" op notify timeout=60s
pcs -f cluster1.xml resource master postgresql-ha postgresql-10-main notify=true target-role="Master"
pcs -f cluster1.xml constraint colocation add postgresql-master-vip with master postgresql-ha INFINITY id=postgresql-master-with-vip
pcs -f cluster1.xml constraint order promote postgresql-ha then start postgresql-master-vip symmetrical=false kind=Mandatory id=promote-postgresql-then-add-vip
pcs -f cluster1.xml constraint order demote postgresql-ha then stop postgresql-master-vip symmetrical=false kind=Mandatory id=demote-postgresql-then-remove-vip
pcs cluster cib-push cluster1.xml
------

Then when I check `pcs status` a few seconds later, I see errors starting PostgreSQL on all nodes, because they cannot connect to the master VIP.  It seems that Pacemaker is copying in a recovery.conf to the PostgreSQL data directory on the master server, so it tries to start up as a slave, and the VIP is not brought up anywhere.

How do I get Pacemaker to leave the master server in master status and bring the VIP up there first?  I'm really not sure what I'm doing wrong above...

Worst of all, at this point `pcs cluster stop` hangs forever, and I have to forcibly kill pacemakerd and the other pacemaker daemons along with the PostgreSQL master process on each node.

Thanks in advance for any help that you might be able to offer,
-- 
Casey