[ClusterLabs] pacemaker after upgrade from wheezy to jessie

Toni Tschampke tt at halle.it
Thu Nov 10 03:47:46 EST 2016


> Did your upgrade documentation describe how to update the corosync
> configuration, and did that go well? crmd may be unable to function due
> to lack of quorum information.

Thanks for this tip, corosync quorum configuration was the cause.

As we changed validate-with as well as the feature set manually in the 
cib, is there a need for issuing the cibadmin --upgrade --force command 
or is this command just for changing the schemes?

--
Mit freundlichen Grüßen

Toni Tschampke | tt at halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 08.11.2016 um 22:51 schrieb Ken Gaillot:
> On 11/07/2016 09:08 AM, Toni Tschampke wrote:
>> We managed to change the validate-with option via workaround (cibadmin
>> export & replace) as setting the value with cibadmin --modify doesn't
>> write the changes to disk.
>>
>> After experimenting with various schemes (xml is correctly interpreted
>> by crmsh) we are still not able to communicate with local crmd.
>>
>> Can someone please help to determine why the local crmd is not
>> responding (we disabled our other nodes to eliminate possible corosync
>> related issues) and runs into errors/timeouts when issuing crmsh or
>> cibadmin related commands.
>
> It occurs to me that wheezy used corosync 1. There were major changes
> from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
> pacemaker, whereas 2 has quorum built-in.
>
> Did your upgrade documentation describe how to update the corosync
> configuration, and did that go well? crmd may be unable to function due
> to lack of quorum information.
>
>> examples for not working local commands
>>
>> timeout when running cibadmin: (strace attachment)
>>> cibadmin --upgrade --force
>>> Call cib_upgrade failed (-62): Timer expired
>>
>> error when running a crm resource cleanup
>>> crm resource cleanup $vm
>>> Error signing on to the CRMd service
>>> Error performing operation: Transport endpoint is not connected
>>
>> I attached the strace log from running cib_upgrade, does this help to
>> find the cause of the timeout issue?
>>
>> Here is the corosync dump when locally starting pacemaker:
>>
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
>>> Corosync Cluster Engine ('2.3.6'): started and ready to provide service.
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [MAIN  ] main.c:1257
>>> Corosync built-in features: dbus rdma monitoring watchdog augeas
>>> systemd upstart xmlconf qdevices snmp pie relro bindnow
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemnet.c:248 Initializing transport (UDP/IP Multicast).
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
>>> none hash: none
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemnet.c:248 Initializing transport (UDP/IP Multicast).
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
>>> none hash: none
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemudp.c:671 The network interface [10.112.0.1] is now up.
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync configuration map access [0]
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
>>> ipc_setup.c:536 server name: cmap
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync configuration service [1]
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
>>> ipc_setup.c:536 server name: cfg
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync cluster closed process group service
>>> v1.01 [2]
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
>>> ipc_setup.c:536 server name: cpg
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync profile loading service [4]
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync resource monitoring service [6]
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [WD    ] wd.c:669
>>> Watchdog /dev/watchdog is now been tickled by corosync.
>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD    ] wd.c:625
>>> Could not change the Watchdog timeout from 10 to 6 seconds
>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD    ] wd.c:464
>>> resource load_15min missing a recovery key.
>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD    ] wd.c:464
>>> resource memory_used missing a recovery key.
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [WD    ] wd.c:581 no
>>> resources configured.
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync watchdog service [7]
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>>> Service engine loaded: corosync cluster quorum service v0.1 [3]
>>> Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
>>> ipc_setup.c:536 server name: quorum
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemudp.c:671 The network interface [10.110.1.1] is now up.
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>>> totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members
>>> joined: 1
>>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:310
>>> Completed service synchronization, ready to provide service.
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice: main:
>>> Starting Pacemaker 1.1.15 | build=e174ec8 features: generated-manpages
>>> agent-manpages ascii-docs publican-docs ncurses libqb-logging
>>> libqb-ipc lha-fencing upstart systemd nagios  corosync-native
>>> atomic-attrd snmp libesmtp acls
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info: main:
>>> Maximum core file size is: 18446744073709551615
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> qb_ipcs_us_publish:        server name: pacemakerd
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice:
>>> get_node_name:     Could not obtain a node name for corosync nodeid 1
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> crm_get_peer:      Created entry
>>> 283a5061-34c2-4b81-bff9-738533f22277/0x7f8a151931a0 for node (null)/1
>>> (1 total)
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> crm_get_peer:      Node 1 has uuid 1
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1] -
>>> corosync-cpg is now online
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:    error:
>>> cluster_connect_quorum:    Corosync quorum is not configured
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice:
>>> get_node_name:     Defaulting to uname -n for the local corosync node
>>> name
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> crm_get_peer:      Node 1 is now known as nebel1
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Using uid=108 and group=114 for process cib
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Forked child 24342 for process cib
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Forked child 24343 for process stonith-ng
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Forked child 24344 for process lrmd
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Using uid=108 and group=114 for process attrd
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Forked child 24345 for process attrd
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Using uid=108 and group=114 for process pengine
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Forked child 24346 for process pengine
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Using uid=108 and group=114 for process crmd
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> start_child:       Forked child 24347 for process crmd
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info: main:
>>> Starting mainloop
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> pcmk_cpg_membership:       Node 1 joined group pacemakerd (counter=0.0)
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> pcmk_cpg_membership:       Node 1 still member of group pacemakerd
>>> (peer=nebel1, counter=0.0)
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
>>> mcp_cpg_deliver:   Ignoring process list sent by peer for local node
>>> Nov 07 16:01:59 [24342] nebel1        cib:     info:
>>> crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
>>> Nov 07 16:01:59 [24342] nebel1        cib:   notice: main:      Using
>>> legacy config location: /var/lib/heartbeat/crm
>>> Nov 07 16:01:59 [24342] nebel1        cib:     info:
>>> get_cluster_type:  Verifying cluster type: 'corosync'
>>> Nov 07 16:01:59 [24342] nebel1        cib:     info:
>>> get_cluster_type:  Assuming an active 'corosync' cluster
>>> Nov 07 16:01:59 [24342] nebel1        cib:     info:
>>> retrieveCib:       Reading cluster configuration file
>>> /var/lib/heartbeat/crm/cib.xml (digest:
>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>> Nov 07 16:01:59 [24344] nebel1       lrmd:     info:
>>> crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
>>> Nov 07 16:01:59 [24344] nebel1       lrmd:     info:
>>> qb_ipcs_us_publish:        server name: lrmd
>>> Nov 07 16:01:59 [24344] nebel1       lrmd:     info: main:      Starting
>>> Nov 07 16:01:59 [24346] nebel1    pengine:     info:
>>> crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
>>> Nov 07 16:01:59 [24346] nebel1    pengine:     info:
>>> qb_ipcs_us_publish:        server name: pengine
>>> Nov 07 16:01:59 [24346] nebel1    pengine:     info: main:
>>> Starting pengine
>>> Nov 07 16:01:59 [24345] nebel1      attrd:     info:
>>> crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
>>> Nov 07 16:01:59 [24345] nebel1      attrd:     info: main:
>>> Starting up
>>> Nov 07 16:01:59 [24345] nebel1      attrd:     info:
>>> get_cluster_type:  Verifying cluster type: 'corosync'
>>> Nov 07 16:01:59 [24345] nebel1      attrd:     info:
>>> get_cluster_type:  Assuming an active 'corosync' cluster
>>> Nov 07 16:01:59 [24345] nebel1      attrd:   notice:
>>> crm_cluster_connect:       Connecting to cluster infrastructure: corosync
>>> Nov 07 16:01:59 [24347] nebel1       crmd:     info:
>>> crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
>>> Nov 07 16:01:59 [24347] nebel1       crmd:     info: main:      CRM
>>> Git Version: 1.1.15 (e174ec8)
>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng:     info:
>>> crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng:     info:
>>> get_cluster_type:  Verifying cluster type: 'corosync'
>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng:     info:
>>> get_cluster_type:  Assuming an active 'corosync' cluster
>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng:   notice:
>>> crm_cluster_connect:       Connecting to cluster infrastructure: corosync
>>> Nov 07 16:01:59 [24347] nebel1       crmd:     info: do_log:    Input
>>> I_STARTUP received in state S_STARTING from crmd_init
>>> Nov 07 16:01:59 [24347] nebel1       crmd:     info:
>>> get_cluster_type:  Verifying cluster type: 'corosync'
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:00 [24342] nebel1        cib:   notice:
>>> get_node_name:     Could not obtain a node name for corosync nodeid 1
>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng:   notice:
>>> get_node_name:     Defaulting to uname -n for the local corosync node
>>> name
>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng:     info:
>>> crm_get_peer:      Node 1 is now known as nebel1
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> crm_get_peer:      Created entry
>>> f5df58e3-3848-440c-8f6b-d572f8fa9b9c/0x7f0ce1744570 for node (null)/1
>>> (1 total)
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> crm_get_peer:      Node 1 has uuid 1
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1] -
>>> corosync-cpg is now online
>>> Nov 07 16:02:00 [24342] nebel1        cib:   notice:
>>> crm_update_peer_state_iter:        Node (null) state is now member |
>>> nodeid=1 previous=unknown source=crm_update_peer_proc
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> init_cs_connection_once:   Connection to 'corosync': established
>>> Nov 07 16:02:00 [24345] nebel1      attrd:     info: main:
>>> Cluster connection active
>>> Nov 07 16:02:00 [24345] nebel1      attrd:     info:
>>> qb_ipcs_us_publish:        server name: attrd
>>> Nov 07 16:02:00 [24345] nebel1      attrd:     info: main:
>>> Accepting attribute updates
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:00 [24342] nebel1        cib:   notice:
>>> get_node_name:     Defaulting to uname -n for the local corosync node
>>> name
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> crm_get_peer:      Node 1 is now known as nebel1
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> qb_ipcs_us_publish:        server name: cib_ro
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> qb_ipcs_us_publish:        server name: cib_rw
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> qb_ipcs_us_publish:        server name: cib_shm
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info: cib_init:
>>> Starting cib mainloop
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> pcmk_cpg_membership:       Node 1 joined group cib (counter=0.0)
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> pcmk_cpg_membership:       Node 1 still member of group cib
>>> (peer=nebel1, counter=0.0)
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> cib_file_backup:   Archived previous version as
>>> /var/lib/heartbeat/crm/cib-72.raw
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> cib_file_write_with_digest:        Wrote version 0.8464.0 of the CIB
>>> to disk (digest: 5201c56641a95e5117df4184587c3e93)
>>> Nov 07 16:02:00 [24342] nebel1        cib:     info:
>>> cib_file_write_with_digest:        Reading cluster configuration file
>>> /var/lib/heartbeat/crm/cib.naRhNz (digest:
>>> /var/lib/heartbeat/crm/cib.hLaVCH)
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> do_cib_control:    CIB connection established
>>> Nov 07 16:02:00 [24347] nebel1       crmd:   notice:
>>> crm_cluster_connect:       Connecting to cluster infrastructure: corosync
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:00 [24347] nebel1       crmd:   notice:
>>> get_node_name:     Could not obtain a node name for corosync nodeid 1
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Created entry
>>> 43a3b98f-d81d-4cc7-b46e-4512f24db371/0x7f798ff40040 for node (null)/1
>>> (1 total)
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Node 1 has uuid 1
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1] -
>>> corosync-cpg is now online
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> init_cs_connection_once:   Connection to 'corosync': established
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:00 [24347] nebel1       crmd:   notice:
>>> get_node_name:     Defaulting to uname -n for the local corosync node
>>> name
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Node 1 is now known as nebel1
>>> Nov 07 16:02:00 [24347] nebel1       crmd:     info:
>>> peer_update_callback:      nebel1 is now in unknown state
>>> Nov 07 16:02:00 [24347] nebel1       crmd:    error:
>>> cluster_connect_quorum:    Corosync quorum is not configured
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 2
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 2
>>> Nov 07 16:02:01 [24347] nebel1       crmd:   notice:
>>> get_node_name:     Could not obtain a node name for corosync nodeid 2
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Created entry
>>> c790c642-6666-4022-bba9-f700e4773b03/0x7f79901428e0 for node (null)/2
>>> (2 total)
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Node 2 has uuid 2
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 3
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 3
>>> Nov 07 16:02:01 [24347] nebel1       crmd:   notice:
>>> get_node_name:     Could not obtain a node name for corosync nodeid 3
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Created entry
>>> 928f8124-4d29-4285-99de-50038d3c3b7e/0x7f7990142a20 for node (null)/3
>>> (3 total)
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> crm_get_peer:      Node 3 has uuid 3
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> do_ha_control:     Connected to the cluster
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> lrmd_ipc_connect:  Connecting to lrmd
>>> Nov 07 16:02:01 [24342] nebel1        cib:     info:
>>> cib_process_request:       Forwarding cib_modify operation for section
>>> nodes to all (origin=local/crmd/3)
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> do_lrm_control:    LRM connection established
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> do_started:        Delaying start, no membership data (0000000000100000)
>>> Nov 07 16:02:01 [24342] nebel1        cib:     info:
>>> corosync_node_name:        Unable to get node name for nodeid 1
>>> Nov 07 16:02:01 [24342] nebel1        cib:   notice:
>>> get_node_name:     Defaulting to uname -n for the local corosync node
>>> name
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> parse_notifications:       No optional alerts section in cib
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> do_started:        Delaying start, no membership data (0000000000100000)
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> pcmk_cpg_membership:       Node 1 joined group crmd (counter=0.0)
>>> Nov 07 16:02:01 [24347] nebel1       crmd:     info:
>>> pcmk_cpg_membership:       Node 1 still member of group crmd
>>> (peer=nebel1, counter=0.0)
>>> Nov 07 16:02:01 [24342] nebel1        cib:     info:
>>> cib_process_request:       Completed cib_modify operation for section
>>> nodes: OK (rc=0, origin=nebel1/crmd/3, version=0.8464.0)
>>> Nov 07 16:02:01 [24345] nebel1      attrd:     info:
>>> attrd_cib_connect: Connected to the CIB after 2 attempts
>>> Nov 07 16:02:01 [24345] nebel1      attrd:     info: main:      CIB
>>> connection active
>>> Nov 07 16:02:01 [24345] nebel1      attrd:     info:
>>> pcmk_cpg_membership:       Node 1 joined group attrd (counter=0.0)
>>> Nov 07 16:02:01 [24345] nebel1      attrd:     info:
>>> pcmk_cpg_membership:       Node 1 still member of group attrd
>>> (peer=nebel1, counter=0.0)
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info: setup_cib:
>>> Watching for stonith topology changes
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
>>> qb_ipcs_us_publish:        server name: stonith-ng
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info: main:
>>> Starting stonith-ng mainloop
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
>>> pcmk_cpg_membership:       Node 1 joined group stonith-ng (counter=0.0)
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
>>> pcmk_cpg_membership:       Node 1 still member of group stonith-ng
>>> (peer=nebel1, counter=0.0)
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
>>> init_cib_cache_cb: Updating device list from the cib: init
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
>>> cib_devices_update:        Updating devices to version 0.8464.0
>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng:   notice:
>>> unpack_config:     On loss of CCM Quorum: Ignore
>>> Nov 07 16:02:02 [24343] nebel1 stonith-ng:   notice:
>>> stonith_device_register:   Added 'stonith1Nebel2' to the device list
>>> (1 active devices)
>>> Nov 07 16:02:02 [24343] nebel1 stonith-ng:     info:
>>> cib_device_update: Device stonith1Nebel1 has been disabled on nebel1:
>>> score=-INFINITY
>>
>> Current cib settings:
>>> cibadmin -Q | grep validate
>>> <cib admin_epoch="0" epoch="8464" num_updates="0"
>>> validate-with="pacemaker-2.4" crm_feature_set="3.0.10" have-quorum="1"
>>> cib-last-written="Fri Nov  4 12:15:30 2016" update-origin="nebel3"
>>> update-client="crm_attribute" update-user="root">
>>
>> Any help is appreciated, thanks in advance
>>
>> Regards, Toni
>>
>> --
>> Mit freundlichen Grüßen
>>
>> Toni Tschampke | tt at halle.it
>> bcs kommunikationslösungen
>> Inh. Dipl. Ing. Carsten Burkhardt
>> Harz 51 | 06108 Halle (Saale) | Germany
>> tel +49 345 29849-0 | fax +49 345 29849-22
>> www.b-c-s.de | www.halle.it | www.wivewa.de
>>
>>
>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>>
>> Weitere Informationen erhalten Sie unter www.wivewa.de
>>
>> Am 03.11.2016 um 17:42 schrieb Toni Tschampke:
>>>   > I'm guessing this change should be instantly written into the xml
>>> file?
>>>   > If this is the case something is wrong, greping for validate gives the
>>>   > old string back.
>>>
>>> We found some strange behavior when setting "validate-with" via
>>> cibadmin, corosync.log shows the successful transaction, issuing
>>> cibadmin --query gives the correct value but it is NOT written into
>>> cib.xml.
>>>
>>> We restarted pacemaker and value is reset to pacemaker-1.1
>>> If signatures for the cib.xml are generated from pacemaker/cib, which
>>> algorithm is used? looks like md5 to me.
>>>
>>> Would it be possible to manual edit the cib.xml and generate a valid
>>> cib.xml.sig to get one step further in debugging process?
>>>
>>> Regards, Toni
>>>
>>> --
>>> Mit freundlichen Grüßen
>>>
>>> Toni Tschampke | tt at halle.it
>>> bcs kommunikationslösungen
>>> Inh. Dipl. Ing. Carsten Burkhardt
>>> Harz 51 | 06108 Halle (Saale) | Germany
>>> tel +49 345 29849-0 | fax +49 345 29849-22
>>> www.b-c-s.de | www.halle.it | www.wivewa.de
>>>
>>>
>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>>>
>>> Weitere Informationen erhalten Sie unter www.wivewa.de
>>>
>>> Am 03.11.2016 um 16:39 schrieb Toni Tschampke:
>>>>   > I'm going to guess you were using the experimental 1.1 schema as the
>>>>   > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
>>>>   > changing the validate-with to pacemaker-next or pacemaker-1.2 and
>>>> see if
>>>>   > you get better results. Don't edit the file directly though; use the
>>>>   > cibadmin command so it signs the end result properly.
>>>>   >
>>>>   > After changing the validate-with, run:
>>>>   >
>>>>   >    crm_verify -x /var/lib/pacemaker/cib/cib.xml
>>>>   >
>>>>   > and fix any errors that show up.
>>>>
>>>> strange, the location of our cib.xml differs from your path, our cib is
>>>> located in /var/lib/heartbeat/crm/
>>>>
>>>> running cibadmin --modify --xml-text '<cib
>>>> validate-with="pacemaker-1.2"/>'
>>>>
>>>> gave no output but was logged to corosync:
>>>>
>>>> cib:     info: cib_perform_op:    -- <cib num_updates="0"
>>>> validate-with="pacemaker-1.1"/>
>>>> cib:     info: cib_perform_op:    ++ <cib admin_epoch="0" epoch="8462"
>>>> num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
>>>>    have-quorum="1" cib-last-written="Thu Nov  3 10:05:52 2016"
>>>> update-origin="nebel1" update-client="cibadmin" update-user="root"/>
>>>>
>>>> I'm guessing this change should be instantly written into the xml file?
>>>> If this is the case something is wrong, greping for validate gives the
>>>> old string back.
>>>>
>>>> <cib admin_epoch="0" epoch="8462" num_updates="0"
>>>> validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1"
>>>> cib-last-written="Thu Nov  3 16:19:51 2016" update-origin="nebel1"
>>>> update-client="cibadmin" update-user="root">
>>>>
>>>> pacemakerd --features
>>>> Pacemaker 1.1.15 (Build: e174ec8)
>>>> Supporting v3.0.10:
>>>>
>>>> Should the crm_feature_set be updated this way too? I'm guessing this is
>>>> done when "cibadmin --upgrade" succeeds?
>>>>
>>>> We just get an timeout error when trying to upgrade it with cibadmin:
>>>> Call cib_upgrade failed (-62): Timer expired
>>>>
>>>> Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
>>>> quite big /var/lib/heartbeat/crm/ folder some permissions changed:
>>>>
>>>> -rw------- 1 hacluster root      80K Nov  1 16:56 cib-31.raw
>>>> -rw-r--r-- 1 hacluster root       32 Nov  1 16:56 cib-31.raw.sig
>>>> -rw------- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
>>>> -rw------- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig
>>>>
>>>> cib-31 was before upgrading, cib-32 after starting upgraded pacemaker
>>>>
>>>>
>>>> --
>>>> Mit freundlichen Grüßen
>>>>
>>>> Toni Tschampke | tt at halle.it
>>>> bcs kommunikationslösungen
>>>> Inh. Dipl. Ing. Carsten Burkhardt
>>>> Harz 51 | 06108 Halle (Saale) | Germany
>>>> tel +49 345 29849-0 | fax +49 345 29849-22
>>>> www.b-c-s.de | www.halle.it | www.wivewa.de
>>>>
>>>>
>>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
>>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>>>>
>>>> Weitere Informationen erhalten Sie unter www.wivewa.de
>>>>
>>>> Am 03.11.2016 um 15:39 schrieb Ken Gaillot:
>>>>> On 11/03/2016 05:51 AM, Toni Tschampke wrote:
>>>>>> Hi,
>>>>>>
>>>>>> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to
>>>>>> jessie
>>>>>> (pacemaker 1.1.15, corosync 2.3.6).
>>>>>> During the upgrade pacemaker was removed (rc) and reinstalled after
>>>>>> from
>>>>>> jessie-backports, same for crmsh.
>>>>>>
>>>>>> Now we are encountering multiple problems:
>>>>>>
>>>>>> First I checked the configuration on a single node running pacemaker &
>>>>>> corosync which dropped a strange error, followed by multiple lines
>>>>>> stating syntax is wrong. crm configure show then showed up a mixed
>>>>>> view
>>>>>> of xml and crmsh singleline syntax.
>>>>>>
>>>>>>> ERROR: Cannot read schema file
>>>>>> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
>>>>>> directory: '/usr/share/pacemaker/pacemaker-1.1.rng'
>>>>>
>>>>> pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker
>>>>> 1.1.12,
>>>>> as it was used to hold experimental new features rather than as the
>>>>> actual next version of the schema. So, the schema skipped to 1.2.
>>>>>
>>>>> I'm going to guess you were using the experimental 1.1 schema as the
>>>>> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
>>>>> changing the validate-with to pacemaker-next or pacemaker-1.2 and
>>>>> see if
>>>>> you get better results. Don't edit the file directly though; use the
>>>>> cibadmin command so it signs the end result properly.
>>>>>
>>>>> After changing the validate-with, run:
>>>>>
>>>>>     crm_verify -x /var/lib/pacemaker/cib/cib.xml
>>>>>
>>>>> and fix any errors that show up.
>>>>>
>>>>>> When we looked into that folder there was pacemaker-1.0.rng, 1.2
>>>>>> and so
>>>>>> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
>>>>>> were gone. When running crm resource show, all resources showed up,
>>>>>> when
>>>>>> running crm_mon -1fA the output was unexpected as it showed all nodes
>>>>>> offline, with no DC elected:
>>>>>>
>>>>>>> Stack: corosync
>>>>>>> Current DC: NONE
>>>>>>> Last updated: Thu Nov  3 11:11:16 2016
>>>>>>> Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1
>>>>>>>
>>>>>>>                *** Resource management is DISABLED ***
>>>>>>>    The cluster will not attempt to start, stop or recover services
>>>>>>>
>>>>>>> 3 nodes and 73 resources configured:
>>>>>>> 5 resources DISABLED and 0 BLOCKED from being started due to failures
>>>>>>>
>>>>>>> OFFLINE: [ nebel1 nebel2 nebel3 ]
>>>>>>
>>>>>> we tried to manually change dc-version
>>>>>>
>>>>>> when issuing a simple cleanup command I got the following error:
>>>>>>
>>>>>>> crm resource cleanup DrbdBackuppcMs
>>>>>>> Error signing on to the CRMd service
>>>>>>> Error performing operation: Transport endpoint is not connected
>>>>>>
>>>>>> which looks like crmsh is not able to communicate with crmd and
>>>>>> nothing
>>>>>> is logged in this case in corosync.log
>>>>>>
>>>>>> we experimented with multiple config changes (corosync.conf: pacemaker
>>>>>> ver 0 > 1)
>>>>>> cib-bootstrap-options: cluster-infrastructure from openais to corosync
>>>>>>
>>>>>>> Package versions:
>>>>>>> cman 3.1.8-1.2+b1
>>>>>>> corosync 2.3.6-3~bpo8+1
>>>>>>> crmsh 2.2.0-1~bpo8+1
>>>>>>> csync2 1.34-2.3+b1
>>>>>>> dlm-pcmk 3.0.12-3.2+deb7u2
>>>>>>> libcman3 3.1.8-1.2+b1
>>>>>>> libcorosync-common4:amd64 2.3.6-3~bpo8+1
>>>>>>> munin-libvirt-plugins 0.0.6-1
>>>>>>> pacemaker 1.1.15-2~bpo8+1
>>>>>>> pacemaker-cli-utils 1.1.15-2~bpo8+1
>>>>>>> pacemaker-common 1.1.15-2~bpo8+1
>>>>>>> pacemaker-resource-agents 1.1.15-2~bpo8+1
>>>>>>
>>>>>>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
>>>>>>
>>>>>> I attached our cib before upgrade and after, as well as the one with
>>>>>> the
>>>>>> mixed syntax and our corosync.conf.
>>>>>>
>>>>>> When we tried to connect a second node to the cluster, pacemaker
>>>>>> starts
>>>>>> it's deamons, starts corosync and dies after 15 tries with
>>>>>> following in
>>>>>> corosync log:
>>>>>>
>>>>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> crmd: info: do_cib_control: Could not connect to the CIB service:
>>>>>>> Transport endpoint is not connected
>>>>>>> crmd:  warning: do_cib_control:
>>>>>>> Couldn't complete CIB registration 15 times... pause and retry
>>>>>>> attrd: error: attrd_cib_connect: Signon to CIB failed:
>>>>>>> Transport endpoint is not connected (-107)
>>>>>>> attrd: info: main: Shutting down attribute manager
>>>>>>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
>>>>>>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>>>>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>> (2000ms)
>>>>>>> pacemakerd:  warning: pcmk_child_exit:
>>>>>>> The attrd process (12761) can no longer be respawned,
>>>>>>> shutting the cluster down.
>>>>>>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
>>>>>>
>>>>>> A third node joins without above error, but crm_mon still shows all
>>>>>> nodes as offline.
>>>>>>
>>>>>> Thanks for any advice how to solve this, I'm out of ideas now.
>>>>>>
>>>>>> Regards, Toni
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Users mailing list