[Pacemaker] communications problems in cluster

Саша Александров shurrman at gmail.com
Mon Oct 13 09:47:13 EDT 2014


Hi!

I was building a cluster with pacemaker+pacemaker-remote  (CentOS 6.5,
everything from the official repo).
While I had several resources, everything was fine. However, when I added
more VMs (2 nodes and 10 VMs currently) I started to run into problems (see
below).
Strange thing is that when I start cman/pacemaker some time later - they
seem to work fine for some time.

Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
process crmd terminated with signal 13 (pid=30010, core=0)
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
new_event_notification (26448-30010-6): Bad file descriptor (9)
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
Respawning failed child process: crmd
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed

Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
process crmd terminated with signal 13 (pid=30603, core=0)
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
new_event_notification (26448-30603-6): Bad file descriptor (9)
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
Respawning failed child process: crmd
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 crmd[31192]:   notice: crm_add_logfile: Additional
logging available in /var/log/cluster/corosync.log
Oct 13 17:03:57 wings1 cib[26446]:  warning: qb_ipcs_event_sendv:
new_event_notification (26446-30603-11): Broken pipe (32)
Oct 13 17:03:57 wings1 cib[26446]:  warning: cib_notify_send_one:
Notification of client crmd/fe944296-b3a1-4177-a94c-650568e8ff0a failed

..................

So it keeps restarting, I even had to unmanage resources and stop
pacemaker/cman.

Oct 13 17:04:13 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
new_event_notification (26448-32444-6): Bad file descriptor (9)
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
process crmd terminated with signal 13 (pid=32444, core=0)
Oct 13 17:04:13 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
Respawning failed child process: crmd
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 cib[26446]:  warning: qb_ipcs_event_sendv:
new_event_notification (26446-32444-11): Broken pipe (32)
Oct 13 17:04:13 wings1 cib[26446]:  warning: cib_notify_send_one:
Notification of client crmd/ef727424-ce2b-4b3b-8749-82136dc72af8 failed



And one more thing (probably not related, but who knows) - I have CentOS
7.0 on one of the VMs, LRMD is unable to establish communications with
pacemaker_remote on that VM:

(node):
Oct 13 17:31:43 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 6.
Oct 13 17:31:45 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 7.
Oct 13 17:31:47 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 8.
Oct 13 17:31:48 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 9.
Oct 13 17:31:50 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 10.
Oct 13 17:31:51 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 11.
Oct 13 17:31:53 wings1 crmd[3844]:    error: lrmd_tls_send_recv: Remote
lrmd server disconnected while waiting for reply with id 12.

(VM):
Oct 13 21:27:32 bank systemd: Started Pacemaker Remote Service.
Oct 13 21:27:32 bank pacemaker_remoted: Cannot change active directory to
/var/lib/pacemaker/cores: No such file or directory (2)
Oct 13 21:27:32 bank pacemaker_remoted[1853]: notice:
lrmd_init_remote_tls_server: Starting a tls listener on port 3121.
Oct 13 21:27:32 bank pacemaker_remoted[1853]: notice: bind_and_listen:
Listening on address ::
Oct 13 21:31:39 bank pacemaker_remoted[1853]: notice: lrmd_remote_listen:
LRMD client connection established. 0x1c49d60 id:
de49ea57-e94c-45bf-9d2d-d0f36cb2c4f7
Oct 13 21:31:40 bank pacemaker_remoted[1853]: error: crm_abort:
crm_remote_header: Triggered assert at remote.c:118 : endian == ENDIAN_LOCAL
Oct 13 21:31:40 bank pacemaker_remoted[1853]: error: crm_remote_header:
Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor
the swab'd 3c6
c726d
Oct 13 21:31:40 bank pacemaker_remoted[1853]: error: crm_abort:
crm_remote_header: Triggered assert at remote.c:118 : endian == ENDIAN_LOCAL
Oct 13 21:31:40 bank pacemaker_remoted[1853]: error: crm_remote_header:
Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor
the swab'd 3c6c726d
Oct 13 21:31:40 bank pacemaker_remoted[1853]: error: crm_abort:
crm_remote_header: Triggered assert at remote.c:118 : endian == ENDIAN_LOCAL
Oct 13 21:31:40 bank pacemaker_remoted[1853]: error: crm_remote_header:
Invalid message detected, endian mismatch: badadbbd is neither 6d726c3c nor
the swab'd 3c6c726d
Oct 13 21:31:40 bank pacemaker_remoted[1853]: notice:
lrmd_remote_client_destroy: LRMD client disconnecting remote client - name:
<unknown> id: de49ea57-e94c-45bf-9d2d-d0f36cb2c4f7



-- 
Best regards,
Alexandr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141013/5e9a064e/attachment-0002.html>


More information about the Pacemaker mailing list