[Pacemaker] [SOLVED] RE: Slave does not start after failover: Mysql circular replication and master-slave resources

Tue Dec 20 08:29:21 UTC 2011

Hi Andreas,

-----Original Message-----
From: Andreas Kurz [mailto:andreas at hastexo.com] 
Sent: 2011. december 19. 15:19
To: pacemaker at oss.clusterlabs.org
Subject: Re: [Pacemaker] [SOLVED] RE: Slave does not start after failover: Mysql circular replication and master-slave resources

On 12/17/2011 10:51 AM, Attila Megyeri wrote:
> Hi all,
> 
> For anyone interested.
> I finally made the mysql replication work. For some strange reason there were no [mysql] log entries at all neither in corosync.log nor in the syslog. After a couple of corosync restarts (?!) [mysql] RA debug/error entries started to show up.
> The issue was that the slave could not apply the binary logs due to some duplicate errors. I am not sure how this could happen, but the solution was to ignore the duplicate errors on the slaves, by adding the following line to the my.conf:
> 
> slave-skip-errors = 1062

although you use different "auto-increment-offset" values?

Yes... I am actually quite surprised how this can happen. The slave has applied the binlog already, but for some reason it wants to execute it again.

> 
> I hope this helps to some of you guys as well.
> 
> P.S. Did anyone else notice missing mysql debug/info/error entries in corosync log as well?

There is no RA output/log in any of your syslogs? ... in absence of a connected tty and no configured logd, logger should feed all logs to syslog ... what is your distribution, any "fancy" syslog configuration?

My system is running on a Debian squeeze, pacemaker 1.1.5 squeeze backport. The syslog configuration is standard, no extras. I have noticed this strange behavior (RA not logging anything) many times - not only for the mysql resource but also for postgres. E.g. I added a log_ocf at the entry point of the RA, just to log when the script is executed and what parameter was passed - but I did not see any "monitor" invokes either.
Now it works fine, but this is not an absolutely stable setup.

One other very disturbing issue is, that sometimes corosync and some of the heartbeat processes stuck at 100% CPU and only restart/kill -9 help. :(

Cheers,

Attila

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> Cheers,
> Attila
> 
> 
> -----Original Message-----
> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> Sent: 2011. december 16. 12:39
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Slave does not start after failover: Mysql 
> circular replication and master-slave resources
> 
> Hi Andreas,
> 
> The slave lag cannot be high, as the slave was restarted within 1-2 mins and there are no active users on the system yet.
> I did not find anything at all in the logs.
> 
> I will doublecheck if the RA is the latest.
> 
> Thanks,
> 
> Attila
> 
> 
> -----Original Message-----
> From: Andreas Kurz [mailto:andreas at hastexo.com]
> Sent: 2011. december 16. 1:50
> To: pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] Slave does not start after failover: Mysql 
> circular replication and master-slave resources
> 
> Hello Attila,
> 
> ... see below ...
> 
> On 12/15/2011 02:42 PM, Attila Megyeri wrote:
>> Hi All,
>>
>>  
>>
>> Some time ago I exchanged a couple of posts with you here regarding 
>> Mysql active-active HA.
>>
>> The best solution I found so  far was the Mysql multi-master 
>> replication, also referred to as circular replication.
>>
>>  
>>
>> Basically I set up two nodes, both were capable of the master role, 
>> and the changes were immediately propagated to the other node.
>>
>>  
>>
>> But still I wanted to have a M/S approach, to have a RW master and a 
>> RO slave - mainly because I prefer to have a signle master VIP where 
>> my apps can connect to.
>>
>>  
>>
>> (In the first approach I configured a two node clone, and the master 
>> IP was always bound to one of the nodes)
>>
>>  
>>
>> I applied the following configuration:
>>
>>  
>>
>> node db1 \
>>
>>         attributes IP="10.100.1.31" \
>>
>>         attributes standby="off"
>> db2-log-file-db-mysql="mysql-bin.000021" db2-log-pos-db-mysql="40730"
>>
>> node db2 \
>>
>>         attributes IP="10.100.1.32" \
>>
>>         attributes standby="off"
>>
>> primitive db-ip-master ocf:heartbeat:IPaddr2 \
>>
>>         params lvs_support="true" ip="10.100.1.30" cidr_netmask="8"
>> broadcast="10.255.255.255" \
>>
>>         op monitor interval="20s" timeout="20s" \
>>
>>         meta target-role="Started"
>>
>> primitive db-mysql ocf:heartbeat:mysql \
>>
>>         params binary="/usr/bin/mysqld_safe" config="/etc/mysql/my.cnf"
>> datadir="/var/lib/mysql" user="mysql" pid="/var/run/mysqld/mysqld.pid"
>> socket="/var/run/mysqld/mysqld.sock" test_passwd="XXXXX"
>>
>>         test_table="replicatest.connectioncheck" test_user="slave_user"
>> replication_user="slave_user" replication_passwd="XXXXX"
>> additional_parameters="--skip-slave-start" \
>>
>>         op start interval="0" timeout="120s" \
>>
>>         op stop interval="0" timeout="120s" \
>>
>>         op monitor interval="30" timeout="30s" OCF_CHECK_LEVEL="1" \
>>
>>         op promote interval="0" timeout="120" \
>>
>>         op demote interval="0" timeout="120"
>>
>> ms db-ms-mysql db-mysql \
>>
>>         meta notify="true" master-max="1" clone-max="2"
>> target-role="Started"
>>
>> colocation db-ip-with-master inf: db-ip-master db-ms-mysql:Master
>>
>> property $id="cib-bootstrap-options" \
>>
>>         dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>
>>         cluster-infrastructure="openais" \
>>
>>         expected-quorum-votes="2" \
>>
>>         stonith-enabled="false" \
>>
>>         no-quorum-policy="ignore"
>>
>> rsc_defaults $id="rsc-options" \
>>
>>         resource-stickiness="0"
>>
>>  
>>
>>  
>>
>> The setup works in the basic conditions:
>>
>> *         After the "first" startup, nodes start up as slaves, and
>> shortly after, one of them is promoted to master.
>>
>> *         Updates to the master are replicated properly to the slave.
>>
>> *         Slave accepts updates, which is Wrong, but I can live with
>> this - I will allow connect to the Master VIP only.
>>
>> *         If I stop the slave for some time, and re-start it, it will
>> catch up with the master shortly and get into sync.
>>
>>  
>>
>> I have, however a serious issue:
>>
>> *         If I stop the current master, the slave is promoted, accepts
>> RW queries, the Master IP is bound to it - ALL fine.
>>
>> *         BUT - when I want to bring the other node online, it simply
>> shows: Stopped (not installed)
>>
>>  
>>
>> Online: [ db1 db2 ]
>>
>>  
>>
>> db-ip-master    (ocf::heartbeat:IPaddr2):       Started db1
>>
>> Master/Slave Set: db-ms-mysql [db-mysql]
>>
>>      Masters: [ db1 ]
>>
>>      Stopped: [ db-mysql:1 ]
>>
>>  
>>
>> Node Attributes:
>>
>> * Node db1:
>>
>>     + IP                                : 10.100.1.31
>>
>>     + db2-log-file-db-mysql             : mysql-bin.000021
>>
>>     + db2-log-pos-db-mysql              : 40730
>>
>>     + master-db-mysql:0                 : 3601
>>
>> * Node db2:
>>
>>     + IP                                : 10.100.1.32
>>
>>  
>>
>> Failed actions:
>>
>>     db-mysql:0_monitor_30000 (node=db2, call=58, rc=5, status=complete):
>> not installed
>>
> 
> Looking at the RA (latest from git) I'd say the problem is somewhere in the check_slave() function. Either the check for replication errors or for a too high slave lag ... though on both errors you should see the log. entries.
> 
> Regards,
> Andreas
> 
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> 
>>  
>>
>>  
>>
>> I checked the logs, and could not find a reason why the slave at db2 
>> is not started.
>>
>> Any IDEA Anyone ?
>>
>>  
>>
>>  
>>
>> Thanks,
>>
>> Attila
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org