[Pacemaker] Issues with HA cluster for mysqld

Thu Aug 23 16:56:33 UTC 2012

On 08/23/2012 10:17 AM, David Parker wrote:
> On 08/23/2012 09:01 AM, Jake Smith wrote:
>> ----- Original Message -----
>>> From: "David Parker"<dparker at utica.edu>
>>> To: pacemaker at oss.clusterlabs.org
>>> Sent: Wednesday, August 22, 2012 2:49:32 PM
>>> Subject: [Pacemaker] Issues with HA cluster for mysqld
>>>
>>> Hello,
>>>
>>> I'm trying to set up a 2-node, active-passive HA cluster for MySQL
>>> using
>>> heartbeat and Pacemaker.  The operating system is Debian Linux 6.0.5
>>> 64-bit, and I am using the heartbeat packages installed via apt-get.
>>> The servers involved are the SQL nodes of a running MySQL cluster, so
>>> the only service I need HA for is the MySQL daemon (mysqld).
>>>
>>> What I would like to do is have a single virtual IP address which
>>> clients use to query MySQL, and have the IP and mysqld fail over to
>>> the
>>> passive node in the event of a failure on the active node.  I have
>>> read
>>> through a lot of the heartbeat and Pacemaker documentation, and here
>>> are
>>> the resources I have configured for the cluster:
>>>
>>> * A custom LSB script for mysqld (compliant with Pacemaker's
>>> requirements as outlined in the documentation)
>>> * An iLO2-based STONITH device using riloe (both servers are HP
>>> Proliant
>>> DL380 G5)
>>> * A virtual IP address for mysqld using IPaddr2
>>>
>>> I believe I have configured everything correctly, but I'm not
>>> positive.
>>> Anyway, when I start heartbeat and pacemaker (/etc/init.d/heartbeat
>>> start), everything seems to be ok.  However, the virtual IP never
>>> comes
>>> up, and the output of "crm_resource -LV" indicates that something is
>>> wrong:
>>>
>>> root at ha1:~# crm_resource -LV
>>> crm_resource[28988]: 2012/08/22_14:41:23 WARN: unpack_rsc_op:
>>> Processing
>>> failed op stonith_start_0 on ha1: unknown error (1)
>>>    stonith        (stonith:external/riloe) Started
>>>    MysqlIP        (ocf::heartbeat:IPaddr2) Stopped
>>>    mysqld (lsb:mysqld) Started
>> It looks like you only have one STONITH resource defined... you need 
>> one per server (or to clone the one but that usually applies in 
>> blades not standalone servers).  And then you would add location 
>> constraints not allowing ha1's stonith to run on ha1 and ha2's 
>> stonith not run on ha2 (can't shoot yourself).  That way each server 
>> has the ability to stonith the other. Nothing *should* run if your 
>> stonith fails and you have stonith enabled.
>>
>> HTH
>>
>> Jake
>
> Thanks!  Can you clarify how I would go about putting those 
> constraints in place?  I've been following Andrew's "Configuration 
> Explained" document, and I think I have a grasp on most of these 
> things, but it's not clear to me how I can constrain a STONITH device 
> to only one node.  Also, following the example in the documentation, I 
> added these location constraints to the other resources:
>
> <constraints>
> <rsc_location id="loc-1" rsc="MysqlIP" node="ha1" score="200"/>
> <rsc_location id="loc-2" rsc="MysqlIP" node="ha2" score="0"/>
> <rsc_location id="loc-3" rsc="mysqld" node="ha1" score="200"/>
> <rsc_location id="loc-4" rsc="mysqld" node="ha2" score="0"/>
> </constraints>
>
> I'm trying to make ha1 the preferred node for both mysqld and the 
> virtual IP.  Do these look correct for that?
>
>>> When I attempt to stop heartbeat and Pacemaker (/etc/init.d/heartbeat
>>> stop) it says "Stopping High-Availability services:" and then hangs
>>> for
>>> about 5 minutes before finally stopping the services.
>>>
>>> So, I'm left with a couple of questions.  Is there something wrong
>>> with
>>> my configuration?  Is there a reason why the HA services can't shut
>>> down
>>> in a timely manner?  Is there something else I need to do to get the
>>> virtual IP working?  Thanks in advance for any help!
>
> Would the misconfigured STONITH resources be causing the long shutdown 
> delays?
>

Okay, I think I've almost got this.  I updated my Pacemaker config and 
made a few changes.  I put the MysqlIP and mysqld primitives into a 
resource group called "mysqld-resources", ordered them such that mysqld 
will always wait for MysqlIP to be ready first, and added constraints to 
make ha1 the preferred host for the mysqld-resources group and ha2 the 
failover host.  I also created STONITH devices for both ha1 and ha2, and 
added constraints to fix the STONIOTH location issues.  My new 
constraints section looks like this:

<constraints>
<rsc_location id="loc-1" rsc="stonith-ha1" node="ha2" score="INFINITY"/>
<rsc_location id="loc-2" rsc="stonith-ha2" node="ha1" score="INFINITY"/>
<rsc_location id="loc-3" rsc="stonith-ha1" node="ha1" score="-INFINITY"/>
<rsc_location id="loc-4" rsc="stonith-ha2" node="ha2" score="-INFINITY"/>
<rsc_location id="loc-5" rsc="mysql-resources" node="ha1" score="200"/>
<rsc_location id="loc-6" rsc="mysql-resources" node="ha2" score="0"/>
</constraints>

Everything seems to work.  I had the virtual IP and mysqld running on 
ha1, and not on ha2.  I shut down ha1 using "poweroff -n" and both the 
virtual IP and mysqld came up on ha2 almost instantly.  When I powered 
ha1 on again, ha2 shut down the the virtual IP and mysqld.  The virtual 
IP moved over instantly; a continuous ping of the IP produced one "Time 
to live exceeded" message and one packet was lost, but that's to be 
expected.  However, mysqld took almost 30 seconds to start up on ha1 
after being stopped on ha2, and I'm not exactly sure why.

Here's the relevant log output from ha2:

Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating 
action 16: stop mysqld_stop_0 on ha2 (local)
Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing 
key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 )
Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop
Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output: (mysqld:stop:stdout) 
Stopping MySQL daemon: mysqld_safe.
Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM operation 
mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok
Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action 
mysqld_stop_0 (16) confirmed on ha2 (rc=0)

And here's the relevant log output from ha1:

Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing 
key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 )
Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe
Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM operation 
mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not running
Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing 
key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 )
Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start
Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output: (mysqld:start:stdout) 
Starting MySQL daemon: mysqld_safe.#012(See 
/usr/local/mysql/data/mysql.messages for messages).
Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM operation 
mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok

So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until 
11:43:36, a full 46 seconds after it was stopped on ha2.  Any ideas why 
the delay for mysqld was so long, when the MysqlIP resource moved almost 
instantly?

     Thanks!
     Dave

-- 

Dave Parker
Systems Administrator
Utica College
Integrated Information Technology Services
(315) 792-3229
Registered Linux User #408177