[Pacemaker] rejoin failure

Fri Dec 14 20:15:18 EST 2012

Hi,

we have a 2 node corosync 1.4.2 and pacemaker 1.1.7 cluster running
drdb, nfs, solr and redis in master slave configurations.

Currently node 2 is unable to rejoin cluster after being fenced by stonith.

the logs on node 2
Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message:
Membership 0: quorum still lost
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2:
id=33610762 state=member (new) addr=(null) votes=1 (new) born=0 seen=0
proc=00000000000000000000000000111312 (new)
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: get_ais_nodeid: Server
details: id=33610762 uname=www2 cname=pcmk
Dec 15 01:52:38 www2 stonith-ng: [6706]: info:
init_ais_connection_once: Connection to 'classic openais (with
plugin)': established
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node www2
now has id: 33610762
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node
33610762 is now known as www2
Dec 15 01:52:38 www2 attrd: [6708]: notice: main: Starting mainloop...
Dec 15 01:52:38 www2 stonith-ng: [6706]: notice: setup_cib: Watching
for stonith topology changes
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: main: Starting
stonith-ng mainloop
Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] Incrementing problem
counter for seqid 1 iface 46.248.167.141 to [1 of 10]
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 11800: memb=0,
new=0, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 11800: memb=1,
new=1, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: NEW:  www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: MEMB: www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Dec 15 01:52:38 www2 corosync[6682]:   [CPG   ] chosen downlist:
sender r(0) ip(10.220.0.2) r(1) ip(46.248.167.141) ; members(old:0
left:0)
Dec 15 01:52:38 www2 corosync[6682]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 11804: memb=1,
new=0, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: memb: www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 11804: memb=2,
new=1, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Creating entry for node 16833546 born on 11804
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Node 16833546/unknown is now: member
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: NEW:  .pending. 16833546
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: MEMB: .pending. 16833546
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: MEMB: www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
send_member_notification: Sending membership update 11804 to 1
children
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
0x200cdc0 Node 33610762 ((null)) born on: 11804
Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message:
Membership 11804: quorum still lost
Dec 15 01:52:38 www2 cib: [6705]: info: crm_new_peer: Node <null> now
has id: 16833546
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node (null):
id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1)
ip(46.248.167.140)  votes=0 born=0 seen=11804
proc=00000000000000000000000000000000
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2:
id=33610762 state=member addr=r(0) ip(10.220.0.2) r(1)
ip(46.248.167.141)  (new) votes=1 born=0 seen=11804
proc=00000000000000000000000000111312
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
0x20157a0 Node 16833546 (www1) born on: 11708
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
0x20157a0 Node 16833546 now known as www1 (was: (null))
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Node www1 now has process list: 00000000000000000000000000111312
(1118994)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Node www1 now has 1 quorum votes (was 0)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
send_member_notification: Sending membership update 11804 to 1
children
Dec 15 01:52:38 www2 cib: [6705]: notice: ais_dispatch_message:
Membership 11804: quorum acquired
Dec 15 01:52:38 www2 cib: [6705]: info: crm_get_peer: Node 16833546 is
now known as www1
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www1:
id=16833546 state=member addr=r(0) ip(10.220.0.1) r(1)
ip(46.248.167.140)  votes=1 (new) born=11708 seen=11804
proc=00000000000000000000000000111312 (new)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.crmd failed: ipc delivery
failed (rc=-2)
Dec 15 01:52:38 www2 corosync[6682]:   [CPG   ] chosen downlist:
sender r(0) ip(10.220.0.1) r(1) ip(46.248.167.140) ; members(old:1
left:0)
Dec 15 01:52:38 www2 corosync[6682]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Dec 15 01:52:38 www2 cib: [6705]: info: cib_process_diff: Diff
0.407.26 -> 0.407.27 not applied to 0.406.0: current "epoch" is less
than required
Dec 15 01:52:38 www2 cib: [6705]: info: cib_server_process_diff:
Requesting re-sync from peer
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.27 -> 0.407.28 (sync in progress)
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.28 -> 0.407.29 (sync in progress)
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.29 -> 0.407.30 (sync in progress)
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.30 -> 0.407.31 (sync in progress)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.crmd failed: ipc delivery
failed (rc=-2)
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.31 -> 0.407.32 (sync in progress)
Dec 15 01:52:38 www2 cib: [6705]: info: cib_process_diff: Diff
0.407.32 -> 0.407.33 not applied to 0.406.0: current "epoch" is less
than required
Dec 15 01:52:38 www2 cib: [6705]: info: cib_server_process_diff:
Requesting re-sync from peer
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.33 -> 0.407.34 (sync in progress)
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.34 -> 0.407.35 (sync in progress)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] WARN:
route_ais_message: Sending message to local.crmd failed: ipc delivery
failed (rc=-2)
Dec 15 01:52:38 www2 cib: [6705]: notice: cib_server_process_diff: Not
applying diff 0.407.35 -> 0.407.36 (sync in progress)
Dec 15 01:52:39 www2 crmd: [6710]: info: do_cib_control: CIB
connection established
Dec 15 01:52:39 www2 crmd: [6710]: info: get_cluster_type: Cluster
type is: 'openais'
Dec 15 01:52:39 www2 crmd: [6710]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_classic:
AIS connection established
Dec 15 01:52:39 www2 corosync[6682]:   [pcmk  ] info: pcmk_ipc:
Recorded connection 0x2024af0 for crmd/6710
Dec 15 01:52:39 www2 corosync[6682]:   [pcmk  ] info: pcmk_ipc:
Sending membership update 11804 to crmd
Dec 15 01:52:39 www2 crmd: [6710]: info: get_ais_nodeid: Server
details: id=33610762 uname=www2 cname=pcmk
Dec 15 01:52:39 www2 crmd: [6710]: info: init_ais_connection_once:
Connection to 'classic openais (with plugin)': established
Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node www2 now
has id: 33610762
Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node 33610762
is now known as www2
Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
www2 is now unknown
Dec 15 01:52:39 www2 crmd: [6710]: info: do_ha_control: Connected to the cluster
Dec 15 01:52:39 www2 crmd: [6710]: info: do_started: Delaying start,
no membership data (0000000000100000)
Dec 15 01:52:39 www2 crmd: [6710]: notice: ais_dispatch_message:
Membership 11804: quorum acquired
Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node www1 now
has id: 16833546
Dec 15 01:52:39 www2 crmd: [6710]: info: crm_new_peer: Node 16833546
is now known as www1
Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
www1 is now unknown
Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
www1 is now member (was unknown)
Dec 15 01:52:39 www2 crmd: [6710]: info: crm_update_peer: Node www1:
id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1)
ip(46.248.167.140)  votes=1
Dec 15 01:52:39 www2 crmd: [6710]: notice: crmd_peer_update: Status
update: Client www2/crmd now has status [online] (DC=<null>)
Dec 15 01:52:39 www2 crmd: [6710]: info: ais_status_callback: status:
www2 is now member (was unknown)
Dec 15 01:52:39 www2 crmd: [6710]: info: crm_update_peer: Node www2:
id=33610762 state=member (new) addr=r(0) ip(10.220.0.2) r(1)
ip(46.248.167.141)  (new) votes=1 (new) born=11804 seen=11804
proc=00000000000000000000000000111312 (new)
Dec 15 01:52:39 www2 crmd: [6710]: info: do_started: Delaying start,
Config not read (0000000000000040)
Dec 15 01:52:39 www2 crmd: [6710]: notice: do_started: The local CRM
is operational
Dec 15 01:52:39 www2 crmd: [6710]: notice: do_state_transition: State
transition S_STARTING -> S_PENDING [ input=I_PENDING
cause=C_FSA_INTERNAL origin=do_started ]
Dec 15 01:52:39 www2 corosync[6682]:   [pcmk  ] info:
update_expected_votes: Expected quorum votes 2 -> 3
Dec 15 01:52:40 www2 crmd: [6710]: info: ais_dispatch_message:
Membership 11804: quorum retained
Dec 15 01:52:40 www2 corosync[6682]:   [TOTEM ] ring 1 active with no faults
Dec 15 01:52:41 www2 crmd: [6710]: info: update_dc: Set DC to www1 (3.0.6)
Dec 15 01:52:41 www2 cib: [6705]: info: cib_replace_notify: Replaced:
-1.-1.-1 -> 0.407.36 from www1
Dec 15 01:52:41 www2 crmd: [6710]: info: erase_status_tag: Deleting
xpath: //node_state[@uname='www2']/transient_attributes
Dec 15 01:52:41 www2 crmd: [6710]: info: update_attrd: Connecting to attrd...
Dec 15 01:52:41 www2 crmd: [6710]: notice: do_state_transition: State
transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
origin=do_cl_join_finalize_respond ]
Dec 15 01:52:41 www2 attrd: [6708]: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
Dec 15 01:52:41 www2 crmd: [6710]: info: ais_dispatch_message:
Membership 11804: quorum retained
Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:drbd:1 probe[2] (pid 6718)
Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:nfs_ip probe[3] (pid 6719)
Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:fs_home probe[4] (pid 6720)
Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:52:41 www2 lrmd: [6707]: info: rsc:nfs-kernel-server
probe[5] (pid 6722)
Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:52:41 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[3] on
nfs_ip for client 6710: pid 6719 exited with return code 7
Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[4] on
fs_home for client 6710: pid 6720 exited with return code 7
Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
operation nfs_ip_monitor_0 (call=3, rc=7, cib-update=8,
confirmed=true) not running
Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
operation fs_home_monitor_0 (call=4, rc=7, cib-update=9,
confirmed=true) not running
Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[5] on
nfs-kernel-server for client 6710: pid 6722 exited with return code 7
(mapped from 3)
Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
operation nfs-kernel-server_monitor_0 (call=5, rc=7, cib-update=10,
confirmed=true) not running
Dec 15 01:52:41 www2 lrmd: [6707]: info: operation monitor[2] on
drbd:1 for client 6710: pid 6718 exited with return code 7
Dec 15 01:52:41 www2 crmd: [6710]: info: process_lrm_event: LRM
operation drbd:1_monitor_0 (call=2, rc=7, cib-update=11,
confirmed=true) not running
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:nfs-common probe[6] (pid 6829)
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:fs_nfs_home probe[7] (pid 6830)
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:haproxy_ip probe[8] (pid 6831)
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:redis_ip probe[9] (pid 6832)
Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[9] on
redis_ip for client 6710: pid 6832 exited with return code 7
Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
operation redis_ip_monitor_0 (call=9, rc=7, cib-update=12,
confirmed=true) not running
Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[8] on
haproxy_ip for client 6710: pid 6831 exited with return code 7
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:fs_nfs_home probe[7] (pid 6830)
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:haproxy_ip probe[8] (pid 6831)
Dec 15 01:52:42 www2 lrmd: [6707]: info: rsc:redis_ip probe[9] (pid 6832)
Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[9] on
redis_ip for client 6710: pid 6832 exited with return code 7
Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
operation redis_ip_monitor_0 (call=9, rc=7, cib-update=12,
confirmed=true) not running
Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[8] on
haproxy_ip for client 6710: pid 6831 exited with return code 7
Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
operation haproxy_ip_monitor_0 (call=8, rc=7, cib-update=13,
confirmed=true) not running
Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[6] on
nfs-common for client 6710: pid 6829 exited with return code 7 (mapped
from 3)
Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
operation nfs-common_monitor_0 (call=6, rc=7, cib-update=14,
confirmed=true) not running
Dec 15 01:52:42 www2 lrmd: [6707]: info: operation monitor[7] on
fs_nfs_home for client 6710: pid 6830 exited with return code 7
Dec 15 01:52:42 www2 crmd: [6710]: info: process_lrm_event: LRM
operation fs_nfs_home_monitor_0 (call=7, rc=7, cib-update=15,
confirmed=true) not running
Dec 15 01:52:43 www2 lrmd: [6707]: info: rsc:redis_server probe[10] (pid 6922)
Dec 15 01:52:43 www2 lrmd: [6707]: info: rsc:res_haproxy_1 probe[11] (pid 6923)
Dec 15 01:52:43 www2 lrmd: [6707]: info: operation monitor[11] on
res_haproxy_1 for client 6710: pid 6923 exited with return code 7
(mapped from 3)
Dec 15 01:52:43 www2 crmd: [6710]: info: process_lrm_event: LRM
operation res_haproxy_1_monitor_0 (call=11, rc=7, cib-update=16,
confirmed=true) not running
Dec 15 01:52:43 www2 redis[6922]: INFO: Could not read PID,
redis-server is not running
Dec 15 01:52:43 www2 lrmd: [6707]: info: operation monitor[10] on
redis_server for client 6710: pid 6922 exited with return code 7
Dec 15 01:52:43 www2 crmd: [6710]: info: process_lrm_event: LRM
operation redis_server_monitor_0 (call=10, rc=7, cib-update=17,
confirmed=true) not running
Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 8 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 9 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 10 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:41 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 11 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 12 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 13 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 14 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:42 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 15 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:43 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 16 failed: (rc=-41) Remote node did not respond
Dec 15 01:53:43 www2 crmd: [6710]: WARN: cib_rsc_callback: Resource
update 17 failed: (rc=-41) Remote node did not respond
Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:res_php5-fpm_1 probe[12] (pid 6958)
Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:res_php5-fpm_2:0
probe[13] (pid 6959)
Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:solr_ip probe[14] (pid 6960)
Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:solr_server:0 probe[15] (pid 6962)
Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:www1-stonith probe[16] (pid 6963)
Dec 15 01:54:01 www2 lrmd: [6707]: notice: lrmd_rsc_new(): No
lrm_rprovider field in message
Dec 15 01:54:01 www2 lrmd: [6707]: info: rsc:www2-stonith probe[17] (pid 6964)
Dec 15 01:54:01 www2 stonith-ng: [6706]: notice:
stonith_device_action: Device www1-stonith not found
Dec 15 01:54:01 www2 stonith-ng: [6706]: info: stonith_command:
Processed st_execute from lrmd: rc=-12

log on www1 (DC)
Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-solr_server:1
(1355303185)
Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_rsc_op: Operation
monitor found resource solr_server:1 active in master mode on www1
Dec 15 01:52:41 www1 pengine: [2499]: notice: unpack_rsc_op: Operation
monitor found resource drbd:0 active in master mode on www1
Dec 15 01:52:41 www1 attrd: [2498]: notice: attrd_trigger_update:
Sending flush op to all hosts for: master-drbd:0 (10000)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
drbd:1#011(www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
fs_nfs_home#011(www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move
haproxy_ip#011(Started www1 -> www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move
redis_ip#011(Started www1 -> www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Restart
redis_server#011(Started www1)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Move
res_haproxy_1#011(Started www1 -> www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
res_php5-fpm_2:0#011(www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
solr_server:0#011(www2)
Dec 15 01:52:41 www1 pengine: [2499]: notice: LogActions: Start
www1-stonith#011(www2)
Dec 15 01:52:41 www1 crmd: [2500]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Dec 15 01:52:41 www1 crmd: [2500]: info: do_te_invoke: Processing
graph 3158 (ref=pe_calc-dc-1355532761-6586) derived from
/var/lib/pengine/pe-input-880.bz2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 11: monitor drbd:1_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 12: monitor nfs_ip_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 13: monitor fs_home_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 14: monitor nfs-kernel-server_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 15: monitor nfs-common_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 16: monitor fs_nfs_home_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 17: monitor haproxy_ip_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 18: monitor redis_ip_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 19: monitor redis_server_monitor_0 on www2
Dec 15 01:52:41 www1 crmd: [2500]: info: te_rsc_command: Initiating
action 20: monitor res_haproxy_1_monitor_0 on www2
Dec 15 01:52:41 www1 pengine: [2499]: notice: process_pe_message:
Transition 3158: PEngine Input stored in:
/var/lib/pengine/pe-input-880.bz2
Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer
popped (timeout=20000, abort_level=0, complete=false)
Dec 15 01:54:01 www1 crmd: [2500]: ERROR: print_elem: Aborting
transition, action lost: [Action 11]: In-flight (id: drbd:1_monitor_0,
loc: www2, priority: 0)
Dec 15 01:54:01 www1 crmd: [2500]: info: abort_transition_graph:
action_timer_callback:535 - Triggered transition abort (complete=0) :
Action lost
Dec 15 01:54:01 www1 crmd: [2500]: WARN: cib_action_update: rsc_op 11:
drbd:1_monitor_0 on www2 timed out
Dec 15 01:54:01 www1 crmd: [2500]: info: create_operation_update:
cib_action_update: Updating resouce drbd:1 after Timed Out monitor op
(interval=0)
Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer
popped (timeout=20000, abort_level=1000000, complete=false)
Dec 15 01:54:01 www1 crmd: [2500]: ERROR: print_elem: Aborting
transition, action lost: [Action 12]: In-flight (id: nfs_ip_monitor_0,
loc: www2, priority: 0)
Dec 15 01:54:01 www1 crmd: [2500]: info: abort_transition_graph:
action_timer_callback:535 - Triggered transition abort (complete=0) :
Action lost
Dec 15 01:54:01 www1 crmd: [2500]: WARN: cib_action_update: rsc_op 12:
nfs_ip_monitor_0 on www2 timed out
Dec 15 01:54:01 www1 crmd: [2500]: info: create_operation_update:
cib_action_update: Updating resouce nfs_ip after Timed Out monitor op
(interval=0)
Dec 15 01:54:01 www1 crmd: [2500]: WARN: action_timer_callback: Timer
popped (timeout=20000, abort_level=1000000, complete=false)

cibadmin -Q doesn't work on www2 when corosync is up,

on www1 www2 is visible as online in crm_mon

after the stonith reboot on www2
/var/lib/heartbeat/crm/cib.xml was owned by root, I fixed this and now
cib.xml files are the same on both nodes.

corosync-cfgtool -s reports no faults,

objdump also seems ok
runtime.totem.pg.mrp.srp.members.33610762.ip=r(0) ip(10.220.0.2) r(1)
ip(46.x.x.141)
runtime.totem.pg.mrp.srp.members.33610762.join_count=1
runtime.totem.pg.mrp.srp.members.33610762.status=joined
runtime.totem.pg.mrp.srp.members.16833546.ip=r(0) ip(10.220.0.1) r(1)
ip(46.x.x.140)
runtime.totem.pg.mrp.srp.members.16833546.join_count=1
runtime.totem.pg.mrp.srp.members.16833546.status=joined

Kind Regards,

Michal Grzedzicki