[Pacemaker] Strange split-brain behavior
Raoul Bhatia [IPAX]
r.bhatia at ipax.at
Thu Apr 17 12:31:29 UTC 2008
hi,
i have a cluster consisting of two servers: wc01 and wc02.
no stonith is enabled.
i started wc02. wc02 is plugged into the switch
i started wc01 - no connection to wc02.
after the servers (and heartbeat) is started, i plug wc01 into the
switch. the two node find eachother but remain in split-brain mode.
ok, now i tried to configure stonith (via ssh as its a test).
1) cibadmin -U does not work. it times out and strace shows:
> sendto(5, "i\0\0\0\315\253\0\0>>>\ncib_op=register\ncib_"..., 113, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 113
> recvfrom(5, "K\0\0\0\315\253\0\0>>>\ncib_op=register\ncib_"..., 4048, MSG_DONTWAIT, NULL, NULL) = 83
> poll([{fd=5, events=0}], 1, 0) = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0) = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0) = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0) = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0) = 0
> brk(0x589000) = 0x589000
> brk(0x5aa000) = 0x5aa000
> brk(0x5cb000) = 0x5cb000
now i unplug wc01. wc02 notices this. now i tried cibadmin -U without
success. i then tried to restart heartbeat - this does not work.
the strange thing is:
> # crm_mon -1|grep DC
> Current DC: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396)
and
> # tail -n 4 /var/log/ha-log
> heartbeat[2416]: 2008/04/17_14:16:40 info: killing /usr/lib/heartbeat/crmd process group 2483 with signal 15
> crmd[2483]: 2008/04/17_14:16:40 info: crm_shutdown: Requesting shutdown
> crmd[2483]: 2008/04/17_14:16:40 info: do_shutdown_req: Sending shutdown request to DC: <null>
> cib[2479]: 2008/04/17_14:20:15 info: cib_stats: Processed 963 operations (6573.00us average, 1% utilization) in the last 10min
what is the problem? is this a pacemaker or linux-ha problem, that the
cluster is reacting this way?
cheers,
raoul
> # dpkg -l|egrep -i "(heartbeat|stonith|pacemaker)"
> ii heartbeat 2.1.3-18 Subsystem for High-Availability Linux
> ii heartbeat-2 2.1.3-18 Subsystem for High-Availability Linux
> ii libstonith0 2.1.3-18 Interface for remotely powering down a node
> ii pacemaker 0.6.2-1 High-Availability cluster resource manager f
> ii stonith 2.1.3-18 Interface for remotely powering down a node
--
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc. email. r.bhatia at ipax.at
Technischer Leiter
IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at
Barawitzkagasse 10/2/2/11 email. office at ipax.at
1190 Wien tel. +43 1 3670030
FN 277995t HG Wien fax. +43 1 3670030 15
____________________________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wc02.ha-debug.gz
Type: application/x-gzip
Size: 133924 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080417/b0e71f6d/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wc02.ha-log.gz
Type: application/x-gzip
Size: 111869 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080417/b0e71f6d/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wc02.report.tar.gz
Type: application/x-gzip
Size: 131258 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080417/b0e71f6d/attachment-0005.bin>
More information about the Pacemaker
mailing list