Consul configuration check failed during reconfigure
During our demo for repeatable DB creation, we added a new Omnibus configured Patroni node to an existing Patroni cluster.
After the node was provisioned, the first gitlab-ctl reconfigure
resulted in the following error:
[2021-09-08T13:53:34+00:00] INFO: execute[reload all sysctl conf] ran successfully
[2021-09-08T13:53:34+00:00] INFO: env_dir[/opt/gitlab/etc/node-exporter/env] sending restart action to runit_service[node-exporter] (delayed)
[2021-09-08T13:53:34+00:00] INFO: file[/var/opt/gitlab/consul/config.d/node-exporter-service.json] sending run action to execute[reload consul] (delayed)
[2021-09-08T13:53:36+00:00] INFO: execute[reload consul] ran successfully
[2021-09-08T13:53:36+00:00] INFO: env_dir[/opt/gitlab/etc/postgres-exporter/env] sending restart action to runit_service[postgres-exporter] (delayed)
[2021-09-08T13:53:37+00:00] INFO: file[/var/opt/gitlab/consul/config.json] sending run action to ruby_block[consul config change] (delayed)
[2021-09-08T13:53:37+00:00] WARN: You have made a change to the consul configuration, and the daemon was reloaded.
If the change isn't taking effect, restarting the consul agents may be required:
https://docs.gitlab.com/ee/administration/consul.html#restart-consul
[2021-09-08T13:53:37+00:00] INFO: ruby_block[consul config change] called
[2021-09-08T13:53:37+00:00] ERROR: Running exception handlers
[2021-09-08T13:53:37+00:00] ERROR: Exception handlers complete
[2021-09-08T13:53:37+00:00] FATAL: Stacktrace dumped to /opt/gitlab/embedded/cookbooks/cache/chef-stacktrace.out
[2021-09-08T13:53:37+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2021-09-08T13:53:37+00:00] FATAL: Errno::ECONNREFUSED: ruby_block[warn pending consul restart] (consul::enable_daemon line 43) had an error: Errno::ECONNREFUSED: Failed to open TCP connection to localhost:8500 (Connection refused - connect(2) for "localhost" port 8500)
It appears that although Consul was running, we were unable to connect to localhost:8500 to do this configuration check.
This failure was about second after the reload, here is the Consul log leading up to the failure:
2021-09-08_13:53:34.91144 2021-09-08T13:53:34.160Z [INFO] agent: (LAN) joined: number_of_nodes=3
2021-09-08_13:53:34.91144 2021-09-08T13:53:34.161Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
2021-09-08_13:53:36.49832 2021-09-08T13:53:36.498Z [WARN] agent: Check is now critical: check=service:postgresql
2021-09-08_13:53:36.69839 2021-09-08T13:53:36.696Z [INFO] agent: Caught: signal=hangup
2021-09-08_13:53:36.70049 2021-09-08T13:53:36.700Z [INFO] agent: Synced node info
2021-09-08_13:53:36.70819 2021-09-08T13:53:36.708Z [INFO] agent: Synced service: service=postgres-exporter
2021-09-08_13:53:36.71072 2021-09-08T13:53:36.710Z [WARN] agent.auto_config: Node name "ci-postgres-3.c.gitlab-sb-db-alpha.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-09-08_13:53:36.71074 2021-09-08T13:53:36.710Z [WARN] agent.auto_config: using enable-script-checks without ACLs and without allow_write_http_from is DANGEROUS, use enable-local-script-checks instead, see https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/
2021-09-08_13:53:36.71075 2021-09-08T13:53:36.710Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-09-08_13:53:36.71974 2021-09-08T13:53:36.719Z [INFO] agent: Synced service: service=postgresql
2021-09-08_13:53:36.73060 2021-09-08T13:53:36.730Z [INFO] agent: Synced service: service=node-exporter
2021-09-08_13:53:40.73297 2021-09-08T13:53:40.732Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ci-postgres-2.c.gitlab-sb-db-alpha.internal 10.142.0.42
2021-09-08_13:53:45.03814 2021-09-08T13:53:45.038Z [WARN] agent: Check is now critical: check=service:postgresql
2021-09-08_13:53:56.39795 2021-09-08T13:53:56.397Z [WARN] agent: Check is now critical: check=service:postgresql
My guess is that we probably have a race condition where Consul is told to reload, and then we immediately try to get the running version over the API using the Consul helper https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/f9f19a4501fca236e27559c98f3d7449d7fdcd84/files/gitlab-cookbooks/consul/recipes/enable_daemon.rb#L43-54
A 2nd reconfigure immediately following this error worked fine.
Maybe we can add a retry for this check?
cc @twk3