[Pacemaker] Strange split-brain behavior

Thu Apr 17 12:31:29 UTC 2008

hi,

i have a cluster consisting of two servers: wc01 and wc02.
no stonith is enabled.

i started wc02. wc02 is plugged into the switch
i started wc01 - no connection to wc02.

after the servers (and heartbeat) is started, i plug wc01 into the
switch. the two node find eachother but remain in split-brain mode.

ok, now i tried to configure stonith (via ssh as its a test).

1) cibadmin -U does not work. it times out and strace shows:
> sendto(5, "i\0\0\0\315\253\0\0>>>\ncib_op=register\ncib_"..., 113, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 113
> recvfrom(5, "K\0\0\0\315\253\0\0>>>\ncib_op=register\ncib_"..., 4048, MSG_DONTWAIT, NULL, NULL) = 83
> poll([{fd=5, events=0}], 1, 0)          = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0)          = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0)          = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0)          = 0
> recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=5, events=0}], 1, 0)          = 0
> brk(0x589000)                           = 0x589000
> brk(0x5aa000)                           = 0x5aa000
> brk(0x5cb000)                           = 0x5cb000

now i unplug wc01. wc02 notices this. now i tried cibadmin -U without
success. i then tried to restart heartbeat - this does not work.

the strange thing is:

> # crm_mon -1|grep DC
> Current DC: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396)

and
> # tail -n 4 /var/log/ha-log
> heartbeat[2416]: 2008/04/17_14:16:40 info: killing /usr/lib/heartbeat/crmd process group 2483 with signal 15
> crmd[2483]: 2008/04/17_14:16:40 info: crm_shutdown: Requesting shutdown
> crmd[2483]: 2008/04/17_14:16:40 info: do_shutdown_req: Sending shutdown request to DC: <null>
> cib[2479]: 2008/04/17_14:20:15 info: cib_stats: Processed 963 operations (6573.00us average, 1% utilization) in the last 10min

what is the problem? is this a pacemaker or linux-ha problem, that the
cluster is reacting this way?

cheers,
raoul

> # dpkg -l|egrep -i "(heartbeat|stonith|pacemaker)"
> ii  heartbeat                  2.1.3-18                             Subsystem for High-Availability Linux
> ii  heartbeat-2                2.1.3-18                             Subsystem for High-Availability Linux
> ii  libstonith0                2.1.3-18                             Interface for remotely powering down a node 
> ii  pacemaker                  0.6.2-1                              High-Availability cluster resource manager f
> ii  stonith                    2.1.3-18                             Interface for remotely powering down a node 
-- 
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
Barawitzkagasse 10/2/2/11           email.            office at ipax.at
1190 Wien                           tel.               +43 1 3670030
FN 277995t HG Wien                  fax.            +43 1 3670030 15
____________________________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wc02.ha-debug.gz
Type: application/x-gzip
Size: 133924 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080417/b0e71f6d/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wc02.ha-log.gz
Type: application/x-gzip
Size: 111869 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080417/b0e71f6d/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: wc02.report.tar.gz
Type: application/x-gzip
Size: 131258 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20080417/b0e71f6d/attachment-0005.bin>