Telemetry: Support multi-node setups (#3576) · Epics · GitLab.org

Telemetry: Support multi-node setups

We decided that Prometheus is the most promising vehicle for collecting memory & topology data for self-managed customers and querying it back prior to submitting it via a [usage ping](https://docs.gitlab.com/ee/development/telemetry/usage_ping.html). However, with our current setup, this only works for single-node deployments, where we run an embedded Prometheus server, and which we know will be on `localhost`. For multi-node deployments (2k+ [reference architectures](https://docs.gitlab.com/ee/administration/reference_architectures/)), Prometheus will run as part of a dedicated `monitoring` node instead, so the Sidekiq job responsible for collecting usage ping data won't know how to locate it. The discussions leading up to this issue have already identified several avenues for how to go about this, including: - The ability to specify the address of an external Prometheus server in `gitlab.yml`. This is logged as https://gitlab.com/gitlab-org/gitlab/-/issues/30175. - The possibility of querying Consul for where Prometheus is located, if Consul is available(*) (*) For the [2k reference architecture](https://docs.gitlab.com/ee/administration/reference_architectures/2k_users.html), Consul is not running, so we might have to find another way. Or maybe this is just a mistake in the documentation. Needs clarification. #### Challenges * **Location.** We would have to rely on users to use the new config setting. Or find it via a Consul DNS request. Which requires Consul to run. Meaning, we either need to get quite creative here or will have to rely on users to configure this so that Usage Ping can make use of it, which might take many months or might never happen. * **Authentication.** We don't know whether it will be reachable without some sort of token. For .com, Prometheus servers run on GCP and require a GCP auth token to be queried. Do we know anything about how our customers run Prometheus today? * **Configuration.** We currently heavily rely on the `instance` and `job` labels to assign metrics to a given service. If customers change their label or job setup, we might be misreporting metrics. It could make sense to create custom rules that pre-aggregate values in a way we control, as was suggested by Ben here: https://gitlab.com/gitlab-org/gitlab/-/issues/216660#note_358952794 * **Swappable components.** It is possible to bring an alternate from NGINX in all scenarios, and have that provide the 'fronting' for all services behind (workhorse (gateway to rails), registry, minio, pages, grafana). The most significant portion of the time, this is NGINX, but we (GitLab) intentionally allow the flexibility of the user for this components on *all* platforms. https://gitlab.com/gitlab-org/gitlab/-/issues/217698#note_363647122 (this is a problem in the sense that we will not have metrics for those) * **Data Privacy.** We need to be careful not to accidentally include any non-GitLab data, since an external Prometheus might be scraping other services we have no knowledge of. * **Testing.** This has already been a pain for single-node. We need to be able to reliably test this on multi-node setups, such as review-apps. We do run Prometheus as part of the perf-test deployments though, for instance the 10k prometheus is available here: http://10k.testbed.gitlab.net:9090/

epic