[Pacemaker] Issues with fence and corosync crash
Simone Felici
s.felici at mclink.eu
Fri Dec 24 11:05:27 UTC 2010
Hi to all!
I've an issue with my cluster env. First of all my config:
Two Cluster CentOS5.5 Active+Standby with one DRBD partition managing a Nagios service, ip, and storage.
The config files at the bottom.
I'm trying to test fence option to prevent split brain and problems on double access on drbd partition.
Starting on a sane situation, manual switching of the resources or simulating kernel-panic, crash of process or whatever, all
works well. If I try to shutdown the eth1 (192.168.100.0 as well as cross cable to drbd mirroring) the active stay as it is, it
calls the fence option adding the entry to crm config:
location drbd-fence-by-handler-ServerData ServerData \
rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne opsview-core01-tn
But the standby node kills the corosync process:
*** STANDBY NODE LOG ***
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14158 iface 192.168.100.12 to [1 of 10]
Dec 24 11:00:04 corosync [TOTEM ] Incrementing problem counter for seqid 14160 iface 192.168.100.12 to [2 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14162 iface 192.168.100.12 to [3 of 10]
Dec 24 11:00:05 corosync [TOTEM ] Incrementing problem counter for seqid 14164 iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [3 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14166 iface 192.168.100.12 to [4 of 10]
Dec 24 11:00:06 corosync [TOTEM ] Incrementing problem counter for seqid 14168 iface 192.168.100.12 to [5 of 10]
Dec 24 11:00:07 corosync [TOTEM ] Incrementing problem counter for seqid 14170 iface 192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14172 iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [6 of 10]
Dec 24 11:00:08 corosync [TOTEM ] Incrementing problem counter for seqid 14174 iface 192.168.100.12 to [7 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14176 iface 192.168.100.12 to [8 of 10]
Dec 24 11:00:09 corosync [TOTEM ] Incrementing problem counter for seqid 14178 iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Decrementing problem counter for iface 192.168.100.12 to [8 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14180 iface 192.168.100.12 to [9 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Incrementing problem counter for seqid 14182 iface 192.168.100.12 to [10 of 10]
Dec 24 11:00:10 corosync [TOTEM ] Marking seqid 14182 ringid 0 interface 192.168.100.12 FAULTY - adminisrtative intervention required.
Dec 24 11:00:11 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:12 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:13 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: No such
file or directory (2)
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: ais_dispatch: AIS connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn stonithd: [5151]: ERROR: AIS connection terminated
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: ais_dispatch: AIS connection failed
Dec 24 11:00:14 opsview-core02-tn crmd: [5156]: ERROR: crm_ais_destroy: AIS connection terminated
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: ais_dispatch: AIS connection failed
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: ais_dispatch: AIS connection failed
Dec 24 11:00:14 opsview-core02-tn cib: [5152]: ERROR: cib_ais_destroy: AIS connection terminated
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service!
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: info: main: Exiting...
Dec 24 11:00:14 opsview-core02-tn attrd: [5154]: ERROR: attrd_cib_connection_destroy: Connection to the CIB terminated...
*** STANDBY NODE LOG ***
The issues are not finished.
If I put up back the interface eth1, start corosync again and check that the ring are both online (corosync-cfgtool -r) the
cluster-standby tries to take the services even if resource-stickiness is set. It goes into error maybe due fence script.
crm status:
============
Last updated: Fri Dec 24 11:06:40 2010
Stack: openais
Current DC: opsview-core01-tn - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ opsview-core01-tn opsview-core02-tn ]
Master/Slave Set: ServerData
drbd_data:0 (ocf::linbit:drbd): Slave opsview-core02-tn (unmanaged) FAILED
Stopped: [ drbd_data:1 ]
Failed actions:
drbd_data:0_stop_0 (node=opsview-core02-tn, call=9, rc=6, status=complete): not configured
LOGS on slave:
****************************************
Dec 24 11:06:13 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'): started and ready to provide service.
Dec 24 11:06:13 corosync [MAIN ] Corosync built-in features: nss rdma
Dec 24 11:06:13 corosync [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transport (UDP/IP).
Dec 24 11:06:13 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 24 11:06:13 corosync [TOTEM ] The network interface [192.168.100.12] is now up.
Dec 24 11:06:13 corosync [pcmk ] info: process_ais_conf: Reading configure
Set r/w permissions for uid=0, gid=0 on /var/log/cluster/corosync.log
Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle: 4730966301143465986 for logging
Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional logging options...
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'off' for option: debug
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_logfile
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found '/var/log/cluster/corosync.log' for option: logfile
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Found 'yes' for option: to_syslog
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'daemon' for option: syslog_facility
Dec 24 11:06:13 corosync [pcmk ] info: config_find_init: Local handle: 7739444317642555395 for service
Dec 24 11:06:13 corosync [pcmk ] info: config_find_next: Processing additional service options...
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'pcmk' for option: clustername
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_logd
Dec 24 11:06:13 corosync [pcmk ] info: get_config_opt: Defaulting to 'no' for option: use_mgmtd
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: CRM: Initialized
Dec 24 11:06:13 corosync [pcmk ] Logging: Initialized pcmk_startup
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Maximum core file size is: 18446744073709551615
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Service: 9
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_startup: Local hostname: opsview-core02-tn
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_update_nodeid: Local node id: 207923392
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node 207923392 born on 0
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node 207923392 now known as opsview-core02-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn lrmd: [5153]: info: lrmd is shutting down
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: G_main_add_SignalHandler: Added signal handler for signal 10
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: Invoked: /usr/lib64/heartbeat/attrd
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Signal sent to pid=5153, waiting for process to exit
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn now has 1 quorum votes (was 0)
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: G_main_add_SignalHandler: Added signal handler for signal 12
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting up
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 15
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: info: Invoked: /usr/lib64/heartbeat/pengine
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 207923392/opsview-core02-tn is now: member
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: Invoked: /usr/lib64/heartbeat/cib
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_cluster_connect: Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_cluster_connect: Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: Invoked: /usr/lib64/heartbeat/crmd
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6762 for process stonithd
Dec 24 11:06:13 opsview-core02-tn pengine: [6766]: WARN: main: Terminating previous PE instance
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_TriggerHandler: Added signal manual handler
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: main: CRM Hg Version: da7075976b5ff0bee71074385f8fd02f296ec8a3
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6763 for process cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn pengine: [5155]: WARN: process_pe_message: Received quit message, terminating
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: enabling coredumps
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6764 for process lrmd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig)
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 10
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6765 for process attrd
Dec 24 11:06:13 opsview-core02-tn crmd: [6767]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: G_main_add_SignalHandler: Added signal handler for signal 12
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6766 for process pengine
Dec 24 11:06:13 opsview-core02-tn lrmd: [6764]: info: Started.
Dec 24 11:06:13 corosync [pcmk ] info: spawn_child: Forked child 6767 for process crmd
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.0.9
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync extended virtual synchrony service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync configuration service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster config database access v1.01
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync profile loading service
Dec 24 11:06:13 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Dec 24 11:06:13 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Dec 24 11:06:13 corosync [TOTEM ] The network interface [172.18.17.12] is now up.
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: init_ais_connection_once: AIS connection established
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: init_ais_connection_once: AIS connection established
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x868c90 for attrd/6765
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x86d0a0 for stonithd/6762
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Cluster connection active
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn
cname=pcmk
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Accepting attribute updates
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn attrd: [6765]: info: main: Starting mainloop...
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: notice: /usr/lib64/heartbeat/stonithd start up successfully.
Dec 24 11:06:13 opsview-core02-tn stonithd: [6762]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: startCib: CIB Initialization completed successfully
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_cluster_connect: Connecting to OpenAIS
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: init_ais_connection_once: AIS connection established
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x872fa0 for cib/6763
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core02-tn now has process list:
00000000000000000000000000013312 (78610)
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_ipc: Sending membership update 0 to cib
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_init: Starting cib mainloop
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 0: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)
addr=(null) votes=1 (new) born=0 seen=0 proc=00000000000000000000000000013312 (new)
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Archived previous version as
/var/lib/heartbeat/crm/cib-26.raw
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: write_cib_contents: Wrote version 0.473.0 of the CIB to disk (digest:
3c7be90920e86222ad6102a0f01d9efd)
Dec 24 11:06:13 opsview-core02-tn cib: [6771]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.UxVZY6 (digest: /var/lib/heartbeat/crm/cib.76RIND)
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 1 iface 172.18.17.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 13032: memb=0, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 13032: memb=1, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to provide service.
Dec 24 11:06:13 corosync [TOTEM ] Incrementing problem counter for seqid 2 iface 192.168.100.12 to [1 of 10]
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 13036: memb=1, new=0, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: memb: opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 13036: memb=2, new=1, lost=0
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Creating entry for node 191146176 born on 13036
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node 191146176/unknown is now: member
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending. 191146176
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending. 191146176
Dec 24 11:06:13 corosync [pcmk ] info: pcmk_peer_update: MEMB: opsview-core02-tn 207923392
Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x2aaaac000920 Node 207923392 ((null)) born on: 13036
Dec 24 11:06:13 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: ais_dispatch: Membership 13036: quorum still lost
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_new_peer: Node <null> now has id: 191146176
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node (null): id=191146176 state=member (new) addr=r(0)
ip(192.168.100.11) r(1) ip(172.18.17.11) votes=0 born=0 seen=13036 proc=00000000000000000000000000000000
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member addr=r(0)
ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 born=0 seen=13036 proc=00000000000000000000000000013312
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176 (opsview-core01-tn) born on: 13028
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: notice: ais_dispatch: Membership 13036: quorum acquired
Dec 24 11:06:13 corosync [pcmk ] info: update_member: 0x825ef0 Node 191146176 now known as opsview-core01-tn (was: (null))
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_get_peer: Node 191146176 is now known as opsview-core01-tn
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn now has process list:
00000000000000000000000000013312 (78610)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member addr=r(0)
ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 (new) born=13028 seen=13036 proc=00000000000000000000000000013312 (new)
Dec 24 11:06:13 corosync [pcmk ] info: update_member: Node opsview-core01-tn now has 1 quorum votes (was 0)
Dec 24 11:06:13 corosync [pcmk ] info: send_member_notification: Sending membership update 13036 to 1 children
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 corosync [MAIN ] Completed service synchronization, ready to provide service.
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_process_diff: Diff 0.475.1 -> 0.475.2 not applied to 0.473.0: current
"epoch" is less than required
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_server_process_diff: Requesting re-sync from peer
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_diff_notify: Local-only Change (client:crmd, call: 105): -1.-1.-1
(Application of an update diff failed, requesting a full refresh)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not applying diff 0.475.2 -> 0.475.3 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not applying diff 0.475.3 -> 0.475.4 (sync in progress)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: WARN: cib_server_process_diff: Not applying diff 0.475.4 -> 0.476.1 (sync in progress)
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2)
Dec 24 11:06:13 opsview-core02-tn cib: [6763]: info: cib_replace_notify: Local-only Replace: -1.-1.-1 from opsview-core01-tn
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Archived previous version as
/var/lib/heartbeat/crm/cib-27.raw
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: write_cib_contents: Wrote version 0.476.0 of the CIB to disk (digest:
c348ac643cfe3b370e5eca03ff7f180c)
Dec 24 11:06:13 opsview-core02-tn cib: [6772]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.FYgzJ8 (digest: /var/lib/heartbeat/crm/cib.VrDRiH)
Dec 24 11:06:13 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: unknown (rc=-2)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_cib_control: CIB connection established
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_cluster_connect: Connecting to OpenAIS
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: Creating connection to our AIS plugin
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: init_ais_connection_once: AIS connection established
Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x878020 for crmd/6767
Dec 24 11:06:14 corosync [pcmk ] info: pcmk_ipc: Sending membership update 13036 to crmd
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: get_ais_nodeid: Server details: id=207923392 uname=opsview-core02-tn cname=pcmk
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node opsview-core02-tn now has id: 207923392
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 207923392 is now known as opsview-core02-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_ha_control: Connected to the cluster
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: Delaying start, CCM (0000000000100000) not connected
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crmd_init: Starting crmd's mainloop
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: Checking for expired actions every 900000ms
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: config_query_callback: Sending expected-votes=2 to corosync
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: notice: ais_dispatch: Membership 13036: quorum acquired
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node opsview-core01-tn now has id: 191146176
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_new_peer: Node 191146176 is now known as opsview-core01-tn
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core01-tn: id=191146176 state=member (new)
addr=r(0) ip(192.168.100.11) r(1) ip(172.18.17.11) votes=1 born=13028 seen=13036 proc=00000000000000000000000000013312
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: crm_update_peer: Node opsview-core02-tn: id=207923392 state=member (new)
addr=r(0) ip(192.168.100.12) r(1) ip(172.18.17.12) (new) votes=1 (new) born=13036 seen=13036
proc=00000000000000000000000000013312 (new)
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_started: The local CRM is operational
Dec 24 11:06:14 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_STARTING -> S_PENDING [
input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Dec 24 11:06:15 opsview-core02-tn pengine: [6766]: info: main: Starting pengine
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: ais_dispatch: Membership 13036: quorum retained
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_dc: Set DC to opsview-core01-tn (3.0.1)
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: update_attrd: Connecting to attrd...
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC
cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for terminate
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for shutdown
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: erase_xpath_callback: Deletion of
"//node_state[@uname='opsview-core02-tn']/transient_attributes": ok (rc=0)
Dec 24 11:06:15 corosync [TOTEM ] ring 0 active with no faults
Dec 24 11:06:15 corosync [TOTEM ] ring 1 active with no faults
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node opsview-core01-tn now has id: 191146176
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: crm_new_peer: Node 191146176 is now known as opsview-core01-tn
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for master-drbd_data:0
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for probe_complete
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation probe_complete=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=9:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:2: probe
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=10:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=ServerFS_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ServerFS:3: probe
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=11:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=ClusterIP01_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:ClusterIP01:4: probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=12:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=opsview-core_lsb_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-core_lsb:5: probe
Dec 24 11:06:15 opsview-core02-tn lrmd: [6764]: notice: lrmd_rsc_new(): No lrm_rprovider field in message
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=opsview-web_lsb_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=14:8:7:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=WebSite_monitor_0 )
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for master-drbd_data:1
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:1=<null>: cib not
connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation terminate=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation shutdown=<null>: cib not connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ClusterIP01_monitor_0 (call=4, rc=7,
cib-update=7, confirmed=true) not running
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation ServerFS_monitor_0 (call=3, rc=7,
cib-update=8, confirmed=true) not running
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0
(1000)
Dec 24 11:06:15 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Delaying operation master-drbd_data:0=1000: cib not
connected
Dec 24 11:06:15 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_monitor_0 (call=2, rc=0,
cib-update=9, confirmed=true) ok
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:opsview-web_lsb:6: probe
Dec 24 11:06:16 opsview-core02-tn lrmd: [6764]: info: rsc:WebSite:7: probe
Dec 24 11:06:16 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation WebSite_monitor_0 (call=7, rc=7,
cib-update=10, confirmed=true) not running
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Connected to the CIB after 1 signon attempts
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: cib_connect: Sending full refresh
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:0
(1000)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 4: master-drbd_data:0=1000
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete
(<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_data:1
(<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Dec 24 11:06:18 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) su: warning: cannot change
directory to /var/log/nagios: No such file or directory
Dec 24 11:06:21 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-core_lsb:probe:stderr) /etc/init.d/opsview: line 262:
/usr/local/nagios/bin/profile: No such file or directory
Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) su: warning: cannot change
directory to /var/log/nagios: No such file or directory
Dec 24 11:06:22 opsview-core02-tn lrmd: [6764]: info: RA output: (opsview-web_lsb:probe:stderr) /etc/init.d/opsview-web: line 171:
/usr/local/nagios/bin/opsview.sh: No such file or directory
Dec 24 11:06:27 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-core_lsb_monitor_0 (call=5, rc=7,
cib-update=11, confirmed=true) not running
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation opsview-web_lsb_monitor_0 (call=6, rc=7,
cib-update=12, confirmed=true) not running
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 15: probe_complete=true
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=61:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_notify_0 )
Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:8: notify
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_notify_0 (call=8, rc=0,
cib-update=13, confirmed=true) ok
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: do_lrm_rsc_op: Performing key=13:10:0:72e2a81d-2f69-4752-b8f9-3294ed06f6a0
op=drbd_data:0_stop_0 )
Dec 24 11:06:28 opsview-core02-tn lrmd: [6764]: info: rsc:drbd_data:0:9: stop
Dec 24 11:06:28 opsview-core02-tn crmd: [6767]: info: process_lrm_event: LRM operation drbd_data:0_stop_0 (call=9, rc=6,
cib-update=14, confirmed=true) not configured
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for fail-count-drbd_data:0
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:
fail-count-drbd_data:0 (INFINITY)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 18: fail-count-drbd_data:0=INFINITY
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_ais_dispatch: Update relayed from opsview-core01-tn
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: find_hash_entry: Creating hash entry for last-failure-drbd_data:0
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_trigger_update: Sending flush op to all hosts for:
last-failure-drbd_data:0 (1293185188)
Dec 24 11:06:28 opsview-core02-tn attrd: [6765]: info: attrd_perform_update: Sent update 21: last-failure-drbd_data:0=1293185188
****************************************
Now the services are all DOWN.
At this point my only way to solve is to reboot cluster02, then starting corosync it does NOT try to take the services again.
The unfence option is still there!
Now the drbd is in this state:
Master/Slave Set: ServerData
Masters: [ opsview-core01-tn ]
Stopped: [ drbd_data:1 ]
due the fence option.
If I try 'drbdadm -- --discard-my-data connect all' on the cluster02 I obtain:
[root at core02-tn ~]# drbdadm -- --discard-my-data connect all
Could not stat("/proc/drbd"): No such file or directory
do you need to load the module?
try: modprobe drbd
Command 'drbdsetup 1 net 192.168.100.12:7789 192.168.100.11:7789 C --set-defaults --create-device --rr-conflict=disconnect
--after-sb-2pri=disconnect --after-sb-1pri=disconnect --after-sb-0pri=disconnect --discard-my-data' terminated with exit code 20
drbdadm connect cluster_data: exited with code 20
I've to remove manually the entry:
location drbd-fence-by-handler-ServerData ServerData \
rule $id="drbd-fence-by-handler-rule-ServerData" $role="Master" -inf: #uname ne opsview-core01-tn
Because I've no idea HOW to unfence the cluster to permit the auto-remove of the above line.
Removing the line, the cluster02 connects back to drbd:
Master/Slave Set: ServerData
Masters: [ opsview-core01-tn ]
Slaves: [ opsview-core02-tn ]
Writing here I've tested that the inverse situation works on half. It means, if the cluster02 is master, i disconnect eth1, then
fence entry is added to crm, but cluster01 does *NOT* crash. So I've to start removing "location
drbd-fence-by-handler-ServerData..." to go back to a standard situation. BTW, removing the entry, on cluster01 the same error and
corosync kills:
********* cluster01 logs **********
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: info: update_dc: Unset DC opsview-core01-tn
Dec 24 12:01:31 corosync [TOTEM ] FAILED TO RECEIVE
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: info: cib_process_request: Operation complete: op cib_modify for section nodes
(origin=local/crmd/165, version=0.491.1): ok (rc=0)
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource
temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: ais_dispatch: AIS connection failed
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: ais_dispatch: AIS connection failed
Dec 24 12:01:31 opsview-core01-tn cib: [22670]: ERROR: cib_ais_destroy: AIS connection terminated
Dec 24 12:01:31 opsview-core01-tn crmd: [22674]: ERROR: crm_ais_destroy: AIS connection terminated
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error:
Resource temporarily unavailable (11)
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: ais_dispatch: AIS connection failed
Dec 24 12:01:31 opsview-core01-tn stonithd: [22669]: ERROR: AIS connection terminated
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Archived previous version as
/var/lib/heartbeat/crm/cib-23.raw
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: write_cib_contents: Wrote version 0.491.0 of the CIB to disk (digest:
ad222fed7ff40dc7093ffc6411079df4)
Dec 24 12:01:31 opsview-core01-tn cib: [32447]: info: retrieveCib: Reading cluster configuration from:
/var/lib/heartbeat/crm/cib.R3dVbk (digest: /var/lib/heartbeat/crm/cib.EllYEu)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_text: Sending message 44: FAILED (rc=2): Library error:
Connection timed out (110)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: Sent update -5: probe_complete=true
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: attrd_cib_callback: Update -5 for probe_complete=true failed: send failed
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ais_message: Not connected to AIS
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_trigger_update: Sending flush op to all hosts for:
master-drbd_data:1 (<null>)
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: send_ipc_message: IPC Channel to 22670 is not connected
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: ERROR: cib_native_perform_op: Sending message to CIB service FAILED
Dec 24 12:01:33 opsview-core01-tn attrd: [22672]: info: attrd_perform_update: Delete operation failed: node=opsview-core01-tn,
attr=master-drbd_data:1, id=<n/a>, set=(null), section=status: send failed (-5)
***********************
So, the questions:
What's wrong? Seems all starts when the corosyng on secondary node crash (or stop) disconnecting the cable (due "Library error"?!?!?)
If I solve the issues with crashes, then, how (/when) should the unfence option be executed? Should it not done automatically?
Do I have always to remove manually the entry (location ...) on crm?
Sorry for the long mail and thanks for the support!
Simon
Config files:
*************************************
cat /etc/corosync/corosync.conf
compatibility: whitetank
totem {
version: 2
# How long before declaring a token lost (ms)
token: 2000
# How many token retransmits before forming a new configuration
token_retransmits_before_loss_const: 10
# How long to wait for join messages in the membership protocol (ms)
join: 200
# How long wait for consensus to be achieved before starting a new round of membership configuration (ms)
consensus: 1000
vsftype: none
# Number of messages that may be sent by one processor on receipt of the token
max_messages: 20
send_join: 0
# Limit generated nodeids to 31-bits (positive signed integers)
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: active
interface {
ringnumber: 0
bindnetaddr: 192.168.100.0
mcastaddr: 226.100.1.1
mcastport: 4000
}
interface {
ringnumber: 1
bindnetaddr: 172.18.17.0
#broadcast: yes
mcastaddr: 227.100.1.2
mcastport: 4001
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
aisexec {
user: root
group: root
}
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 0
}
*************************************
cat /etc/drbd.conf
global {
usage-count no;
}
common {
protocol C;
syncer {
rate 70M;
verify-alg sha1;
}
net {
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
handlers {
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
startup {
degr-wfc-timeout 120; # 2 minutes.
}
disk {
fencing resource-only;
on-io-error call-local-io-error;
}
}
resource cluster_data {
device /dev/drbd1;
disk /dev/sda4;
meta-disk internal;
on opsview-core01-tn {
address 192.168.100.11:7789;
}
on opsview-core02-tn {
address 192.168.100.12:7789;
}
}
*************************************
crm configure show
node opsview-core01-tn \
attributes standby="off"
node opsview-core02-tn \
attributes standby="off"
primitive ClusterIP01 ocf:heartbeat:IPaddr2 \
params ip="172.18.17.10" cidr_netmask="32" \
op monitor interval="30"
primitive ServerFS ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/data" fstype="ext3"
primitive WebSite ocf:heartbeat:apache \
params configfile="/etc/httpd/conf/httpd.conf" \
op monitor interval="1min" \
meta target-role="Started"
primitive drbd_data ocf:linbit:drbd \
params drbd_resource="cluster_data" \
op monitor interval="60s"
primitive opsview-core_lsb lsb:opsview \
op start interval="0" timeout="350s" \
op stop interval="0" timeout="350s" \
op monitor interval="60s" timeout="350s"
primitive opsview-web_lsb lsb:opsview-web \
op start interval="0" timeout="350s" start-delay="15s" \
op stop interval="0" timeout="350s" \
op monitor interval="60s" timeout="350s" \
meta target-role="Started"
group OPSView-Apps ServerFS ClusterIP01 opsview-core_lsb opsview-web_lsb WebSite \
meta target-role="Started"
ms ServerData drbd_data \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master"
colocation fs_on_drbd inf: OPSView-Apps ServerData:Master
order ServerFS-after-ServerData inf: ServerData:promote OPSView-Apps:start
property $id="cib-bootstrap-options" \
dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
More information about the Pacemaker
mailing list