[Pacemaker] Pacemaker and drbd problem
Andrew Beekhof
andrew at beekhof.net
Thu Feb 4 12:04:48 UTC 2010
On Tue, Jan 26, 2010 at 9:47 PM, <subscriptions at ludmata.info> wrote:
> Hi,
>
> I have two node cluster, running Master/Slave drdb and fs resources (there
> will be more resources later). Here are the details of the software I'm
> using: Debian 5.03 stable, DRBD 8.3.7 compiled from source, corosync 1.1.2
> and pacemaker1.0.6 installed from madkiss repository.
>
> I have a problem, that I can't solve for two days and a lot of digging in
> internet. The cluster works right in all cases but one - node1 runs Primary
> DRBD and it is mounted the fs resource, than I simulate power loss (plug
> out the power cord of node1) and node2 takes all resources, promotes DRBD
> and mount the fs (so far so good).Then again I simulate power loss by
> unplugging the power cord of node2. Then I power on node1, and it boots,
> loads its stuff and start corosync and then the cluster resource manager
> promotes DRBD to Primary on node1 (it should not!).
But you told it too.
no-quorum-policy="ignore"
And you prevented the cluster from being sure that the other side
didn't already have drbd promoted
stonith-enabled="false"
Basically you created a split-brain condition and turned off the
options that might have prevented data corruption :-)
> That is a disaster,
> because i intend to run SQL database on that cluster and that way I might
> loose a huge amount of data. I also have ancient two node cluster running
> heartbeat 1 and drbd 6.x with drbdisk resource, and its behavior in that
> case is to stop and ask "My data may be outdated, are you sure you want to
> continue?". I tried the same scenario without cluster engine (that is the
> old way, isnt it) - enabled DRBD init scripts and repeated same steps. In
> that particular case DRBD stopped waiting the other node and asks if i want
> to continue (good boy, that is exactly what I want!). So, my problem must
> be somewhere in configuration of resources, but I can't understand what I'm
> doing wrong. So, let ask straight - How to do this in pacemaker. I just
> want node1 stops and waits for my confirmation what to do or something of
> that sort, but never ever promote drbd to master!
>
> If somebody wonders why I do this scenario, let me explain: My company own
> a APC Smart UPS, who in case of power loss shut down one node of each
> cluster (we have two pairs of clusters in separate vlans, so I cant create
> a 4 node cluster, which will solve this problem, at least partially) after
> the level of the battery falls bellow certain level. If battery runs below
> the critical level, the UPS kills all servers, but two our logging server
> and one of DB nodes. If the power dont come, than ups kills the last node.
> The only machine that waits to its the death is our logging server. When
> the power comes, the UPS starts all servers, that are down. If that happens
> when all nodes are down, we end with the following situation - the first
> node that comes up become SynkSource, and that node may be not the last one
> survived the UPS rage.
>
> One of the possible sollutions is to use old heartbeat resource manager
> drbddisk, that uses the drbd init script. But I don like it :)
>
> Here are my configs:
>
> corosync.conf
> totem {
> version: 2
> token: 3000
> token_retransmits_before_loss_const: 10
> join: 60
> consensus: 1500
> vsftype: none
> max_messages: 20
> clear_node_high_bit: yes
> secauth: off
> threads: 0
> rrp_mode: passive
> #external interface
> interface {
> ringnumber: 0
> bindnetaddr: 10.0.30.0
> mcastaddr: 226.94.1.1
> mcastport: 5405
> }
> #internal interface
> interface {
> ringnumber: 1
> bindnetaddr: 10.2.2.0
> mcastaddr: 226.94.2.1
> mcastport: 5405
> }
> }
> amf {
> mode: disabled
> }
> service {
> ver: 0
> name: pacemaker
> }
> aisexec {
> user: root
> group: root
> }
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: no
> to_syslog: yes
> syslog_facility: daemon
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> }
> }
> -------------
> drbd.conf
>
> common {
> syncer { rate 100M; }
> }
> resource drbd0 {
> protocol C;
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
> pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
> pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD
> Alert' root";
> }
> startup { wfc-timeout 0; degr-wfc-timeout 0; }
>
> disk { on-io-error detach; fencing resource-only;}
>
> net {
> sndbuf-size 1024k;
> timeout 20; # 6 seconds (unit = 0.1 seconds)
> connect-int 10; # 10 seconds (unit = 1 second)
> ping-int 3; # 10 seconds (unit = 1 second)
> ping-timeout 5; # 500 ms (unit = 0.1 seconds)
> ko-count 4;
> cram-hmac-alg "sha1";
> shared-secret "password";
> after-sb-0pri disconnect;
> after-sb-1pri disconnect;
> after-sb-2pri disconnect;
> rr-conflict disconnect;
> }
>
> syncer { rate 100M; }
>
> on db1 {
> device /dev/drbd0;
> disk /dev/db/db;
> address 10.2.2.1:7788;
> flexible-meta-disk internal;
> }
>
> on db2 {
> device /dev/drbd0;
> disk /dev/db/db;
> address 10.2.2.2:7788;
> meta-disk internal;
> }
> }
> -----------------------
> crm:
> crm(live)# configure show
> node db1 \
> attributes standby="off"
> node db2 \
> attributes standby="off"
> primitive drbd-db ocf:linbit:drbd \
> params drbd_resource="drbd0" \
> op monitor interval="15s" role="Slave" timeout="30" \
> op monitor interval="16s" role="Master" timeout="30"
> primitive fs-db ocf:heartbeat:Filesystem \
> params fstype="ext3" directory="/db" device="/dev/drbd0"
> primitive ip-dbclust.v52 ocf:heartbeat:IPaddr2 \
> params ip="10.0.30.211" broadcast="10.0.30.255" nic="eth1"
> cidr_netmask="24" \
> op monitor interval="21s" timeout="5s"
> ms ms-db drbd-db \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true" target-role="Started"
> location drbd-fence-by-handler-ms-db ms-db \
> rule $id="drbd-fence-by-handler-rule-ms-db" $role="Master" -inf: #uname
> ne db1
> location lo-ms-db ms-db \
> rule $id="ms-db-loc-rule" -inf: #uname ne db1 and #uname ne db2
> colocation fs-on-drbd0 inf: fs-db ms-db:Master
> colocation ip-on-drbd0 inf: ip-dbclust.v52 ms-db:Master
> order or-drbd-bf-fs inf: ms-db:promote fs-db:start
> order or-drbd-bf-ip inf: ms-db:promote ip-dbclust.v52:start
> property $id="cib-bootstrap-options" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> expected-quorum-votes="2" \
> last-lrm-refresh="1264523323" \
> dc-version="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe" \
> cluster-infrastructure="openais"
>
> I hope somebody can help me, I am completely lost :(
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
More information about the Pacemaker
mailing list