Consul configuration check failed during reconfigure
During our demo for repeatable DB creation, we added a new Omnibus configured Patroni node to an existing Patroni cluster. [recording](https://youtu.be/KSK1S5MaEEk) After the node was provisioned, the first `gitlab-ctl reconfigure` resulted in the following error: ``` [2021-09-08T13:53:34+00:00] INFO: execute[reload all sysctl conf] ran successfully [2021-09-08T13:53:34+00:00] INFO: env_dir[/opt/gitlab/etc/node-exporter/env] sending restart action to runit_service[node-exporter] (delayed) [2021-09-08T13:53:34+00:00] INFO: file[/var/opt/gitlab/consul/config.d/node-exporter-service.json] sending run action to execute[reload consul] (delayed) [2021-09-08T13:53:36+00:00] INFO: execute[reload consul] ran successfully [2021-09-08T13:53:36+00:00] INFO: env_dir[/opt/gitlab/etc/postgres-exporter/env] sending restart action to runit_service[postgres-exporter] (delayed) [2021-09-08T13:53:37+00:00] INFO: file[/var/opt/gitlab/consul/config.json] sending run action to ruby_block[consul config change] (delayed) [2021-09-08T13:53:37+00:00] WARN: You have made a change to the consul configuration, and the daemon was reloaded. If the change isn't taking effect, restarting the consul agents may be required: https://docs.gitlab.com/ee/administration/consul.html#restart-consul [2021-09-08T13:53:37+00:00] INFO: ruby_block[consul config change] called [2021-09-08T13:53:37+00:00] ERROR: Running exception handlers [2021-09-08T13:53:37+00:00] ERROR: Exception handlers complete [2021-09-08T13:53:37+00:00] FATAL: Stacktrace dumped to /opt/gitlab/embedded/cookbooks/cache/chef-stacktrace.out [2021-09-08T13:53:37+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report [2021-09-08T13:53:37+00:00] FATAL: Errno::ECONNREFUSED: ruby_block[warn pending consul restart] (consul::enable_daemon line 43) had an error: Errno::ECONNREFUSED: Failed to open TCP connection to localhost:8500 (Connection refused - connect(2) for "localhost" port 8500) ``` It appears that although Consul was running, we were unable to connect to localhost:8500 to do this configuration check. This failure was about second after the reload, here is the Consul log leading up to the failure: ``` 2021-09-08_13:53:34.91144 2021-09-08T13:53:34.160Z [INFO] agent: (LAN) joined: number_of_nodes=3 2021-09-08_13:53:34.91144 2021-09-08T13:53:34.161Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3 2021-09-08_13:53:36.49832 2021-09-08T13:53:36.498Z [WARN] agent: Check is now critical: check=service:postgresql 2021-09-08_13:53:36.69839 2021-09-08T13:53:36.696Z [INFO] agent: Caught: signal=hangup 2021-09-08_13:53:36.70049 2021-09-08T13:53:36.700Z [INFO] agent: Synced node info 2021-09-08_13:53:36.70819 2021-09-08T13:53:36.708Z [INFO] agent: Synced service: service=postgres-exporter 2021-09-08_13:53:36.71072 2021-09-08T13:53:36.710Z [WARN] agent.auto_config: Node name "ci-postgres-3.c.gitlab-sb-db-alpha.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes. 2021-09-08_13:53:36.71074 2021-09-08T13:53:36.710Z [WARN] agent.auto_config: using enable-script-checks without ACLs and without allow_write_http_from is DANGEROUS, use enable-local-script-checks instead, see https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations/ 2021-09-08_13:53:36.71075 2021-09-08T13:53:36.710Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them. 2021-09-08_13:53:36.71974 2021-09-08T13:53:36.719Z [INFO] agent: Synced service: service=postgresql 2021-09-08_13:53:36.73060 2021-09-08T13:53:36.730Z [INFO] agent: Synced service: service=node-exporter 2021-09-08_13:53:40.73297 2021-09-08T13:53:40.732Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ci-postgres-2.c.gitlab-sb-db-alpha.internal 10.142.0.42 2021-09-08_13:53:45.03814 2021-09-08T13:53:45.038Z [WARN] agent: Check is now critical: check=service:postgresql 2021-09-08_13:53:56.39795 2021-09-08T13:53:56.397Z [WARN] agent: Check is now critical: check=service:postgresql ``` My guess is that we probably have a race condition where Consul is told to reload, and then we immediately try to get the running version over the API using the Consul helper https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/f9f19a4501fca236e27559c98f3d7449d7fdcd84/files/gitlab-cookbooks/consul/recipes/enable_daemon.rb#L43-54 A 2nd reconfigure immediately following this error worked fine. Maybe we can add a retry for this check? cc @twk3
issue