[Pacemaker] Master won't get promoted

Thu Sep 29 09:25:01 EDT 2011

Hi,

On Thu, Sep 29, 2011 at 09:30:55AM -0300, Charles Richard wrote:
> Here it is attached.
> 
> I also see the following 2 errors in the node 2 logs which I assume mean the
> problem is really that node1 is not getting demoted and I'm not sure why:
> 
> Error 1:
> Sep 28 19:53:20 staging2 drbd[8587]: ERROR: mysqld: Called drbdadm -c
> /etc/drbd.conf primary mysqld
> Sep 28 19:53:20 staging2 drbd[8587]: ERROR: mysqld: Exit code 11
> Sep 28 19:53:20 staging2 drbd[8587]: ERROR: mysqld: Command output:
> Sep 28 19:53:20 staging2 lrmd: [1442]: info: RA output:
> (drbd_mysql:1:promote:stdout)
> Sep 28 19:53:22 staging2 lrmd: [1442]: info: RA output:
> (drbd_mysql:1:promote:stderr) 0: State change failed: (-1) Multiple
> primaries not allowed by config
> 
> Error 2:
> Sep 28 19:53:27 staging2 kernel: d-con mysqld: Requested state change failed
> by peer: Refusing to be Primary while peer is not outdated (-7)
> Sep 28 19:53:27 staging2 kernel: d-con mysqld: peer( Primary -> Unknown )
> conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk(
> UpToDate -> DUnknown )
> Sep 28 19:53:27 staging2 kernel: d-con mysqld: meta connection shut down by
> peer.
> 
> Also, failover works fine if i reboot either machine.  The outdated machines
> comes back up as secondary.  The scenario where i get the errors above is
> when i pull the network cable from the primary.  Is that a stonith device
> that should be protecting from this scenario and potentially rebooting the
> primary?

Yes. That's the only way for the cluster to keep sanity in case
of split-brain caused by pulling the network cable.

Thanks,

Dejan

> Feels like I'm getting so close to getting this working!
> 
> Thanks!
> Charles
> 
> On Thu, Sep 29, 2011 at 4:15 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> > Could you attach  /var/lib/pengine/pe-input-3802.bz2 from staging1?
> > That would tell us why.
> >
> > On Mon, Sep 26, 2011 at 10:28 PM, Charles Richard
> > <chachi.richard at gmail.com> wrote:
> > > Hi,
> > >
> > > I'm making some headway finally with my pacemaker install but now that
> > > crm_mon doesn't return errors any more and crm_verify is clear, I'm
> > having a
> > > problem where my master won't get promoted.  Not sure what to do with
> > this
> > > one, any suggestions?   Here's the log snippet and config files:
> > >
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: crm_timer_popped: PEngine
> > > Recheck Timer (I_PE_CALC) just popped!
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition: State
> > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> > cause=C_TIMER_POPPED
> > > origin=crm_timer_popped ]
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> > Progressed
> > > to state S_POLICY_ENGINE after C_TIMER_POPPED
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition: All 2
> > > cluster nodes are eligible to run resources.
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_pe_invoke: Query 106:
> > > Requesting the current CIB: S_POLICY_ENGINE
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_pe_invoke_callback:
> > Invoking
> > > the PE: query=106, ref=pe_calc-dc-1317020772-95, seq=2564, quorate=1
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: unpack_config: Startup
> > > probes: enabled
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: unpack_config: On loss
> > of
> > > CCM Quorum: Ignore
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: unpack_config: Node
> > scores:
> > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: unpack_domains: Unpacking
> > > domains
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: determine_online_status:
> > > Node staging1.dev.applepeak.com is online
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: determine_online_status:
> > > Node staging2.dev.applepeak.com is online
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: group_print:  Resource
> > > Group: mysql
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: native_print:
> > > fs_mysql#011(ocf::heartbeat:Filesystem):#011Stopped
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: native_print:
> > > ip_mysql#011(ocf::heartbeat:IPaddr2):#011Stopped
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: native_print:
> > > mysqld#011(lsb:mysqld):#011Stopped
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: clone_print:
> > Master/Slave
> > > Set: ms_drbd_mysql
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: short_print:
> > Stopped:
> > > [ drbd_mysql:0 drbd_mysql:1 ]
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: master_color:
> > ms_drbd_mysql:
> > > Promoted 0 instances of a possible 1 to master
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: native_merge_weights:
> > > fs_mysql: Rolling back scores from ip_mysql
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: native_merge_weights:
> > > ip_mysql: Rolling back scores from mysqld
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: master_color:
> > ms_drbd_mysql:
> > > Promoted 0 instances of a possible 1 to master
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > resource
> > > fs_mysql#011(Stopped)
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > resource
> > > ip_mysql#011(Stopped)
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > resource
> > > mysqld#011(Stopped)
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > resource
> > > drbd_mysql:0#011(Stopped)
> > > Sep 26 04:06:12 staging1 pengine: [1685]: notice: LogActions: Leave
> > resource
> > > drbd_mysql:1#011(Stopped)
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition: State
> > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> > > cause=C_IPC_MESSAGE origin=handle_response ]
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: unpack_graph: Unpacked
> > > transition 72: 0 actions in 0 synapses
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_te_invoke: Processing
> > graph
> > > 72 (ref=pe_calc-dc-1317020772-95) derived from
> > > /var/lib/pengine/pe-input-3802.bz2
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: run_graph:
> > > ====================================================
> > > Sep 26 04:06:12 staging1 crmd: [1686]: notice: run_graph: Transition 72
> > > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> > > Source=/var/lib/pengine/pe-input-3802.bz2): Complete
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: te_graph_trigger: Transition
> > 72
> > > is now complete
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: notify_crmd: Transition 72
> > > status: done - <null>
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition: State
> > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> > > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > > Sep 26 04:06:12 staging1 crmd: [1686]: info: do_state_transition:
> > Starting
> > > PEngine Recheck Timer
> > > Sep 26 04:06:12 staging1 pengine: [1685]: info: process_pe_message:
> > > Transition 72: PEngine Input stored in:
> > /var/lib/pengine/pe-input-3802.bz2
> > > Sep 26 04:15:09 staging1 cib: [1682]: info: cib_stats: Processed 1
> > > operations (0.00us average, 0% utilization) in the last 10min
> > >
> > > My drbd config file:
> > >
> > > resource mysqld {
> > >
> > > protocol C;
> > >
> > > startup { wfc-timeout 0; degr-wfc-timeout 120; }
> > >
> > > disk { on-io-error detach; }
> > >
> > >
> > > on staging1 {
> > >
> > > device /dev/drbd0;
> > >
> > > disk /dev/vg_staging1/lv_data;
> > >
> > > meta-disk internal;
> > >
> > > address 10.10.20.1:7788;
> > >
> > > }
> > >
> > > on staging2 {
> > >
> > > device /dev/drbd0;
> > >
> > > disk /dev/vg_staging2/lv_data;
> > >
> > > meta-disk internal;
> > >
> > > address 10.10.20.2:7788;
> > >
> > > }
> > >
> > > }
> > >
> > > corosync.conf:
> > >
> > > compatibility: whitetank
> > >
> > > aisexec {
> > >   user: root
> > >   group: root
> > > }
> > >
> > > totem {
> > >         version: 2
> > >         secauth: off
> > >         threads: 0
> > >         interface {
> > >                 ringnumber: 0
> > >                 bindnetaddr: 10.10.10.0
> > >                 mcastaddr: 226.94.1.1
> > >                 mcastport: 5405
> > >         }
> > > }
> > >
> > > logging {
> > >         fileline: off
> > >         to_stderr: no
> > >         to_logfile: no
> > >         to_syslog: yes
> > >         logfile: /var/log/cluster/corosync.log
> > >         debug: off
> > >         timestamp: on
> > >         logger_subsys {
> > >                 subsys: AMF
> > >                 debug: off
> > >         }
> > > }
> > >
> > > amf {
> > >         mode: disabled
> > > }
> > >
> > > service {
> > > #Load Pacemaker
> > > name: pacemaker
> > > ver: 0
> > > use_mgmtd: yes
> > > }
> > >
> > > And my crm config:
> > >
> > > node staging1.dev.applepeak.com
> > > node staging2.dev.applepeak.com
> > > primitive drbd_mysql ocf:linbit:drbd \
> > >         params drbd_resource="mysqld" \
> > >         op monitor interval="15s" \
> > >         op start interval="0" timeout="240s" \
> > >         op stop interval="0" timeout="100s"
> > > primitive fs_mysql ocf:heartbeat:Filesystem \
> > >         params device="/dev/drbd0" directory="/opt/data/mysql/data/mysql"
> > > fstype="ext4" \
> > >         op start interval="0" timeout="60s" \
> > >         op stop interval="0" timeout="60s"
> > > primitive ip_mysql ocf:heartbeat:IPaddr2 \
> > >         params ip="10.10.10.31" nic="eth0"
> > > primitive mysqld lsb:mysqld
> > > group mysql fs_mysql ip_mysql mysqld
> > > ms ms_drbd_mysql drbd_mysql \
> > >         meta master-max="1" master-node-max="1" clone-max="2"
> > > clone-node-max="1" notify="true"
> > > colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
> > > order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
> > > property $id="cib-bootstrap-options" \
> > >         dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
> > >         cluster-infrastructure="openais" \
> > >         expected-quorum-votes="2" \
> > >         stonith-enabled="false" \
> > >         last-lrm-refresh="1316961847" \
> > >         stop-all-resources="true" \
> > >         no-quorum-policy="ignore"
> > > rsc_defaults $id="rsc-options" \
> > >         resource-stickiness="100"
> > >
> > > Thanks,
> > > Charles
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs:
> > >
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >
> > >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker