[Pacemaker] Problems shutting down server/services
Erich Weiler
weiler at soe.ucsc.edu
Wed Jun 16 18:09:13 UTC 2010
Hi All,
I'm having a weird issue that I was hoping someone could shed some light
on...
I've got Pacemaker 1.0.8 and Corosync 1.2.2 on a two node cluster. I'm
managing some VMs via VirtualDomain, with DRBD under that. It mostly
works, but I'm having a freal hard time shutting down services, or even
shutting down one of the nodes. As I understand it, if I issue the
shutdown command to one node, pacemaker should try and stop services on
that node and move them to the other node (if configured), then power down.
Normally crm_mon shows this:
============
Last updated: Wed Jun 16 10:23:14 2010
Stack: openais
Current DC: hgvmcore2 - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ hgvmcore1 hgvmcore2 ]
Master/Slave Set: ms-drbd-caweb
Masters: [ hgvmcore1 ]
Slaves: [ hgvmcore2 ]
caweb-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore1
Master/Slave Set: ms-drbd-genome-nt
Masters: [ hgvmcore2 ]
Slaves: [ hgvmcore1 ]
genome-nt-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore2
Looks good. When I try to reboot hgvmcore1, it does not power off, and
I see this in the logs:
Jun 16 10:30:16 hgvmcore1 shutdown[20339]: shutting down for system reboot
Jun 16 10:30:16 hgvmcore1 init: Switching to runlevel: 6
Jun 16 10:30:17 hgvmcore1 crm_attribute: [20398]: info: Invoked:
crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -v 100
Jun 16 10:30:17 hgvmcore1 smartd[16139]: smartd received signal 15:
Terminated
Jun 16 10:30:17 hgvmcore1 smartd[16139]: smartd is exiting (exit status 0)
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Got SIGTERM, quitting.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface vnet1.IPv6 with address fe80::7817:6eff:fe1c:506f.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface vnet0.IPv6 with address fe80::d465:2ff:fefa:73ff.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface virbr0.IPv6 with address fe80::200:ff:fe00:0.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface virbr0.IPv4 with address 192.168.122.1.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface br3.IPv6 with address fe80::225:64ff:fefc:1e3b.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface br3.IPv4 with address 128.114.48.135.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface br2.IPv6 with address fe80::225:64ff:fefc:1e39.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface br2.IPv4 with address 10.1.1.89.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface eth3.IPv6 with address fe80::225:64ff:fefc:1e3b.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface eth2.IPv6 with address fe80::225:64ff:fefc:1e39.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface eth1.IPv6 with address fe80::210:18ff:fe67:db0a.
Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
group on interface eth1.IPv4 with address 172.16.0.1.
Jun 16 10:30:18 hgvmcore1 libvirtd: 10:30:18.192: warning : Shutting
down on signal 15
Jun 16 10:30:18 hgvmcore1 corosync[15541]: [SERV ] Unloading all
Corosync service engines.
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: crm_signal_dispatch:
Invoking handler for signal 15: Terminated
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: crm_shutdown: Requesting
shutdown
Jun 16 10:30:18 hgvmcore1 corosync[15541]: [pcmk ] notice:
pcmk_shutdown: Shuting down Pacemaker
Jun 16 10:30:18 hgvmcore1 corosync[15541]: [pcmk ] notice:
stop_child: Sent -15 to crmd: [15563]
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_shutdown_req: Sending
shutdown request to DC: hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch:
Update relayed from hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
Sending flush op to all hosts for: shutdown (1276709418)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
Sent update 48: shutdown=1276709418
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: cancel_op: operation
monitor[21] on ocf::VirtualDomain::caweb-vd for client 15563, its
parameters: CRM_meta_depth=[0] CRM_meta_interval=[10000] depth=[0]
hypervisor=[qemu:///system] CRM_meta_timeout=[30000]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor]
config=[/etc/libvirt/qemu/caweb.xml] cancelled
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
key=39:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=caweb-vd_stop_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:caweb-vd:22: stop
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
key=88:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=drbd-caweb:0_notify_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-caweb:0:23: notify
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
key=96:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29
op=drbd-genome-nt:1_notify_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-genome-nt:1:24:
notify
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation caweb-vd_monitor_10000 (call=21, status=1, cib-update=0,
confirmed=true) Cancelled
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(caweb-vd:stop:stderr) error: unable to connect to
'/var/run/libvirt/libvirt-sock': Connection refused error: failed to
connect to the hypervisor
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(caweb-vd:stop:stderr) error:
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(caweb-vd:stop:stderr) unable to connect to
'/var/run/libvirt/libvirt-sock': Connection refused error: failed to
connect to the hypervisor
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation drbd-caweb:0_notify_0 (call=23, rc=0, cib-update=32,
confirmed=true) ok
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation drbd-genome-nt:1_notify_0 (call=24, rc=0, cib-update=33,
confirmed=true) ok
Jun 16 10:30:18 hgvmcore1 VirtualDomain[20485]: ERROR: Virtual domain
caweb has no state during stop operation, bailing out.
Jun 16 10:30:18 hgvmcore1 VirtualDomain[20485]: INFO: Issuing forced
shutdown (destroy) request for domain caweb.
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(caweb-vd:stop:stderr) error: unable to connect to
'/var/run/libvirt/libvirt-sock': Connection refused error:
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(caweb-vd:stop:stderr) failed to connect to the hypervisor
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation caweb-vd_stop_0 (call=22, rc=1, cib-update=34, confirmed=true)
unknown error
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: cancel_op: operation
monitor[11] on ocf::drbd::drbd-genome-nt:1 for client 15563, its
parameters: CRM_meta_interval=[15000] CRM_meta_notify_stop_resource=[ ]
CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[ ]
CRM_meta_notify_start_resource=[drbd-genome-nt:1 ]
CRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[ ]
CRM_meta_notify_stop_uname=[ ] drbd_resource=[genome-nt]
CRM_meta_notify_master_uname=[hgvmcore2 ] CRM_meta_master_max=[1]
CRM_meta_notify_demote_uname=[ ] CRM_meta_notify_ cancelled
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
key=46:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29
op=drbd-genome-nt:1_stop_0 )
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-genome-nt:1:25: stop
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation drbd-genome-nt:1_monitor_15000 (call=11, status=1,
cib-update=0, confirmed=true) Cancelled
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch:
Update relayed from hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-caweb-vd (INFINITY)
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: peer( Primary -> Unknown
) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: meta connection shut down
by peer.
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: asender terminated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating asender thread
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Connection closed
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: conn( Disconnecting ->
StandAlone )
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: receiver terminated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating receiver thread
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: disk( UpToDate -> Diskless )
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: drbd_bm_resize called
with capacity == 0
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: worker terminated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating worker thread
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: State change failed: Disk
state is lower than outdated
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: state = { cs:StandAlone
ro:Secondary/Unknown ds:Diskless/DUnknown r--- }
Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: wanted = { cs:StandAlone
ro:Secondary/Unknown ds:Outdated/DUnknown r--- }
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
Sent update 51: fail-count-caweb-vd=INFINITY
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch:
Update relayed from hgvmcore2
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-caweb-vd (1276709418)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
Sent update 54: last-failure-caweb-vd=1276709418
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 crm_attribute: [20593]: info: Invoked:
crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -v -INFINITY
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
Sending flush op to all hosts for: master-drbd-genome-nt:1 (-INFINITY)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
Sent update 56: master-drbd-genome-nt:1=-INFINITY
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 crm_attribute: [20597]: info: Invoked:
crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -D
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
Sending flush op to all hosts for: master-drbd-genome-nt:1 (<null>)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
Sent delete 58: node=hgvmcore1, attr=master-drbd-genome-nt:1, id=<n/a>,
set=(null), section=status
Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
(drbd-genome-nt:1:stop:stdout)
Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
Sent delete 60: node=hgvmcore1, attr=master-drbd-genome-nt:1, id=<n/a>,
set=(null), section=status
Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation drbd-genome-nt:1_stop_0 (call=25, rc=0, cib-update=35,
confirmed=true) ok
Jun 16 10:30:19 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
key=85:449:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=drbd-caweb:0_notify_0 )
Jun 16 10:30:19 hgvmcore1 lrmd: [15560]: info: rsc:drbd-caweb:0:26: notify
Jun 16 10:30:19 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
operation drbd-caweb:0_notify_0 (call=26, rc=0, cib-update=36,
confirmed=true) ok
Jun 16 10:30:48 hgvmcore1 corosync[15541]: [pcmk ] notice:
pcmk_shutdown: Still waiting for crmd (pid=15563, seq=6) to terminate...
and crm_mon shows this:
============
Last updated: Wed Jun 16 10:30:18 2010
Stack: openais
Current DC: hgvmcore2 - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
4 Resources configured.
============
Online: [ hgvmcore1 hgvmcore2 ]
Master/Slave Set: ms-drbd-caweb
Masters: [ hgvmcore1 ]
Slaves: [ hgvmcore2 ]
caweb-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore1
(unmanaged) FAILED
Master/Slave Set: ms-drbd-genome-nt
Masters: [ hgvmcore2 ]
Stopped: [ drbd-genome-nt:1 ]
genome-nt-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore2
Failed actions:
caweb-vd_stop_0 (node=hgvmcore1, call=22, rc=1, status=complete):
unknown error
So it is having trouble shutting down...? After like 15-20 minutes the
computer actually does shut down/reboot, but the VMs shut down quickly,
not sure about DRBD. I don't know if it would help, but here is my crm
config:
node hgvmcore1
node hgvmcore2
primitive caweb-vd ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/caweb.xml"
hypervisor="qemu:///system" \
meta allow-migrate="false" is-managed="true"
target-role="Started" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="10" timeout="30" depth="0"
primitive drbd-caweb ocf:linbit:drbd \
params drbd_resource="caweb" \
op monitor interval="15s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
primitive drbd-genome-nt ocf:linbit:drbd \
params drbd_resource="genome-nt" \
op monitor interval="15s" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s"
primitive genome-nt-vd ocf:heartbeat:VirtualDomain \
params config="/etc/libvirt/qemu/genome-nt.xml"
hypervisor="qemu:///system" \
meta allow-migrate="false" is-managed="true"
target-role="Started" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="10" timeout="30" depth="0"
ms ms-drbd-caweb drbd-caweb \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started" is-managed="true"
ms ms-drbd-genome-nt drbd-genome-nt \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location caweb-prefers-hgvmcore1 caweb-vd 50: hgvmcore1
location genome-nt-prefers-hgvmcore2 genome-nt-vd 50: hgvmcore2
colocation caweb-vd-on-drbd inf: caweb-vd ms-drbd-caweb:Master
colocation genome-nt-vd-on-drbd inf: genome-nt-vd ms-drbd-genome-nt:Master
order caweb-after-drbd inf: ms-drbd-caweb:promote caweb-vd:start
order genome-nt-after-drbd inf: ms-drbd-genome-nt:promote genome-nt-vd:start
property $id="cib-bootstrap-options" \
dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1276708976"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
And here is the corosync config:
# cat /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
compatibility: whitetank
aisexec {
user: root
group: root
}
totem {
version: 2
# How long before declaring a token lost (ms)
token: 5000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 20
# How long to wait for join messages in the membership protocol (ms)
join: 1000
# How long to wait for consensus to be achieved before starting
a new round of membership configuration (ms)
consensus: 7500
# Turn off the virtual synchrony filter
vsftype: none
# Number of messages that may be sent by one processor on
receipt of the token
max_messages: 20
# Disable encryption
secauth: off
# How many threads to use for encryption/decryption
threads: 0
# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes
# Optionally assign a fixed node id (integer)
# nodeid: 1234
interface {
ringnumber: 0
bindnetaddr: 10.1.0.255
mcastaddr: 226.94.1.55
mcastport: 4001
}
}
logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
}
Is this normal, or maybe a known issue? Any help much appreciated!!!
BTW - if I try to shut down a VM via:
# crm resource stop my_vm_resource
crm_mon shows the same thing, as in "caweb-vd
(ocf::heartbeat:VirtualDomain): Started hgvmcore1 (unmanaged) FAILED"
and doesn't seem to cleanly exit, at least through pacemaker, even
though the VM does shut down cleanly. I never put it in 'unmanaged'
mode, it seems to go into that state by itself.
Cheers,
-erich
More information about the Pacemaker
mailing list